Thread: Scaling further up

Scaling further up

From
"Anjan Dave"
Date:
All:
 
We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
 
We are expecting a pretty high load, a few thousands of 'concurrent' users executing either select, insert, update, statments.
 
What is the next step up in terms of  handling very heavy loads? Clustering?
 
Are there any standard, recommended clustering options?
 
 
Also, in terms of hardware, overall, what benefits more, a SunFire 880 (6 or 8 CPUs, lots of RAM, internal FC Drives) type of machine, or an IA-64 architecture?
 
Appreciate any inputs,
 
Thanks,
Anjan

**************************************************************************

This e-mail and any files transmitted with it are intended for the use of the addressee(s) only and may be confidential and covered by the attorney/client and other privileges. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.

 

Re: Scaling further up

From
"Magnus Hagander"
Date:
> All:
>
> We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running
RH9, PG 7.4.0. There's
> an internal U320, 10K RPM RAID-10 setup on 4 drives.
>
> We are expecting a pretty high load, a few thousands of 'concurrent'
users executing either
> select, insert, update, statments.

> What is the next step up in terms of  handling very heavy loads?
Clustering?

I'd look at adding more disks first. Depending on what type of query
load you get, that box sounds like it will be very much I/O bound. More
spindles = more parallell operations = faster under load. Consider
adding 15KRPM disks as well, they're not all that much more expensive,
and should give you better performance than 10KRPM.

Also, make sure you put your WAL disks on a separate RAIDset if possible
(not just a separate partition on existing RAIDset).

Finally, if you don't already have it, look for a battery-backed RAID
controller that can do writeback-cacheing, and enable that. (Don't even
think about enabling it unless it's battery backed!) And add as much RAM
as you can to that controller.


//Magnus

Re: Scaling further up

From
"Anjan Dave"
Date:
For the disks part - I am looking at a SAN implementation, and I will be
planning a separate RAID group for the WALs.

The controller is a PERC, with 128MB cache, and I think it is writeback.

Other than the disks, I am curious what other people are using in terms
of the horsepower needed. The Quad server has been keeping up, but we
are expecting quite high loads in the near future, and I am not sure if
just by having the disks on a high-end storage will do it.

Thanks,
Anjan


-----Original Message-----
From: Magnus Hagander [mailto:mha@sollentuna.net]
Sent: Monday, March 01, 2004 3:54 PM
To: Anjan Dave; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Scaling further up


> All:
>
> We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running
RH9, PG 7.4.0. There's
> an internal U320, 10K RPM RAID-10 setup on 4 drives.
>
> We are expecting a pretty high load, a few thousands of 'concurrent'
users executing either
> select, insert, update, statments.

> What is the next step up in terms of  handling very heavy loads?
Clustering?

I'd look at adding more disks first. Depending on what type of query
load you get, that box sounds like it will be very much I/O bound. More
spindles = more parallell operations = faster under load. Consider
adding 15KRPM disks as well, they're not all that much more expensive,
and should give you better performance than 10KRPM.

Also, make sure you put your WAL disks on a separate RAIDset if possible
(not just a separate partition on existing RAIDset).

Finally, if you don't already have it, look for a battery-backed RAID
controller that can do writeback-cacheing, and enable that. (Don't even
think about enabling it unless it's battery backed!) And add as much RAM
as you can to that controller.


//Magnus

Re: Scaling further up

From
Josh Berkus
Date:
Anjan,

> Other than the disks, I am curious what other people are using in terms
> of the horsepower needed. The Quad server has been keeping up, but we
> are expecting quite high loads in the near future, and I am not sure if
> just by having the disks on a high-end storage will do it.

Do a performance analysis of RH9.  My experience with RH on Xeon has been
quite discouraging lately, and I've been recommending swapping stock kernels
for the RH kernel.

Of course, if this is RHES, rather than the standard, then test & talk to RH
instead.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Scaling further up

From
Christopher Browne
Date:
After a long battle with technology, josh@agliodbs.com (Josh Berkus), an earthling, wrote:
>> Other than the disks, I am curious what other people are using in
>> terms of the horsepower needed. The Quad server has been keeping
>> up, but we are expecting quite high loads in the near future, and I
>> am not sure if just by having the disks on a high-end storage will
>> do it.
>
> Do a performance analysis of RH9.  My experience with RH on Xeon has
> been quite discouraging lately, and I've been recommending swapping
> stock kernels for the RH kernel.

By that, you mean that you recommend that RHAT kernels be replaced by
"stock" ones?

> Of course, if this is RHES, rather than the standard, then test &
> talk to RH instead.

If you're spending the money, better demand value from the vendor...

(And if RHAT is going to charge the big bucks, they'll have to provide
service...)
--
(reverse (concatenate 'string "gro.gultn" "@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/rdbms.html
"I take it all back. Microsoft Exchange is RFC compliant.
RFC 1925, point three." -- Author unknown

Re: Scaling further up

From
William Yu
Date:
Anjan Dave wrote:
> We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9,
> PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
>
> We are expecting a pretty high load, a few thousands of 'concurrent'
> users executing either select, insert, update, statments.

The quick and dirty method would be to upgrade to the recently announced
3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get
another +60% there due to the huge L3 hiding the Xeon's shared bus penalty.

Re: Scaling further up

From
Fred Moyer
Date:
On Tue, 2004-03-02 at 17:42, William Yu wrote:
> Anjan Dave wrote:
> > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running RH9,
> > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4 drives.
> >
> > We are expecting a pretty high load, a few thousands of 'concurrent'
> > users executing either select, insert, update, statments.
>
> The quick and dirty method would be to upgrade to the recently announced
> 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get
> another +60% there due to the huge L3 hiding the Xeon's shared bus penalty.

If you are going to have thousands of 'concurrent' users you should
seriously consider the 2.6 kernel if you are running Linux or as an
alternative going with FreeBSD.  You will need to load test your system
and become an expert on tuning Postgres to get the absolute maximum
performance from each and every query you have.

And you will need lots of hard drives.  By lots I mean dozen(s) in a
raid 10 array with a good controller.  Thousands of concurrent users
means hundreds or thousands of transactions per second.  I've personally
seen it scale that far but in my opinion you will need a lot more hard
drives and ram than cpu.


Re: Scaling further up

From
"Anjan Dave"
Date:
"By lots I mean dozen(s) in a raid 10 array with a good controller."

I believe, for RAID-10, I will need even number of drives. Currently,
the size of the database is about 13GB, and is not expected to grow
exponentially with thousands of concurrent users, so total space is not
of paramount importance compared to performance.

Does this sound reasonable setup?
10x36GB FC drives on RAID-10
4x36GB FC drives for the logs on RAID-10 (not sure if this is the
correct ratio)?
1 hotspare
Total=15 Drives per enclosure.

Tentatively, I am looking at an entry-level EMC CX300 product with 2GB
RAID cache, etc.

Question - Are 73GB drives supposed to give better performance because
of higher number of platters?

Thanks,
Anjan


-----Original Message-----
From: Fred Moyer [mailto:fred@redhotpenguin.com]
Sent: Tuesday, March 02, 2004 5:57 AM
To: William Yu; Anjan Dave
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Scaling further up


On Tue, 2004-03-02 at 17:42, William Yu wrote:
> Anjan Dave wrote:
> > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running
> > RH9,
> > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4
drives.
> >
> > We are expecting a pretty high load, a few thousands of 'concurrent'
> > users executing either select, insert, update, statments.
>
> The quick and dirty method would be to upgrade to the recently
> announced
> 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get

> another +60% there due to the huge L3 hiding the Xeon's shared bus
penalty.

If you are going to have thousands of 'concurrent' users you should
seriously consider the 2.6 kernel if you are running Linux or as an
alternative going with FreeBSD.  You will need to load test your system
and become an expert on tuning Postgres to get the absolute maximum
performance from each and every query you have.

And you will need lots of hard drives.  By lots I mean dozen(s) in a
raid 10 array with a good controller.  Thousands of concurrent users
means hundreds or thousands of transactions per second.  I've personally
seen it scale that far but in my opinion you will need a lot more hard
drives and ram than cpu.


Re: Scaling further up

From
Chris Ruprecht
Date:
Hi all,

If you have a DB of 'only' 13 GB and you do not expect it to grow much, it
might be advisable to have enough memory (RAM) to hold the entire DB in
shared memory (everything is cached). If you have a server with say 24 GB or
memory and can allocate 20 GB for cache, you don't care about the speed of
disks any more - all you worry about is the speed of your memory and your
network connection.
I believe, this not possible using 32-bit technology, you would have to go to
some 64-bit platform, but if it's speed you want ...
You can also try solid state hard disk drives. These are actually just meory,
there are no moving parts, but the look and behave like very very fast disk
drives. I have seen them at capacities of 73 GB - but they didn't mention the
price (I'd probably have a heart attack when I look at the price tag).

Best regards,
Chris


On Tuesday 02 March 2004 14:41, Anjan Dave wrote:
> "By lots I mean dozen(s) in a raid 10 array with a good controller."
>
> I believe, for RAID-10, I will need even number of drives. Currently,
> the size of the database is about 13GB, and is not expected to grow
> exponentially with thousands of concurrent users, so total space is not
> of paramount importance compared to performance.
>
> Does this sound reasonable setup?
> 10x36GB FC drives on RAID-10
> 4x36GB FC drives for the logs on RAID-10 (not sure if this is the
> correct ratio)?
> 1 hotspare
> Total=15 Drives per enclosure.
>
> Tentatively, I am looking at an entry-level EMC CX300 product with 2GB
> RAID cache, etc.
>
> Question - Are 73GB drives supposed to give better performance because
> of higher number of platters?
>
> Thanks,
> Anjan
>
>
> -----Original Message-----
> From: Fred Moyer [mailto:fred@redhotpenguin.com]
> Sent: Tuesday, March 02, 2004 5:57 AM
> To: William Yu; Anjan Dave
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Scaling further up
>
> On Tue, 2004-03-02 at 17:42, William Yu wrote:
> > Anjan Dave wrote:
> > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running
> > > RH9,
> > > PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on 4
>
> drives.
>
> > > We are expecting a pretty high load, a few thousands of 'concurrent'
> > > users executing either select, insert, update, statments.
> >
> > The quick and dirty method would be to upgrade to the recently
> > announced
> > 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is that you'd get
> >
> > another +60% there due to the huge L3 hiding the Xeon's shared bus
>
> penalty.
>
> If you are going to have thousands of 'concurrent' users you should
> seriously consider the 2.6 kernel if you are running Linux or as an
> alternative going with FreeBSD.  You will need to load test your system
> and become an expert on tuning Postgres to get the absolute maximum
> performance from each and every query you have.
>
> And you will need lots of hard drives.  By lots I mean dozen(s) in a
> raid 10 array with a good controller.  Thousands of concurrent users
> means hundreds or thousands of transactions per second.  I've personally
> seen it scale that far but in my opinion you will need a lot more hard
> drives and ram than cpu.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings


Re: Scaling further up

From
Josh Berkus
Date:
Anjan,

> Question - Are 73GB drives supposed to give better performance because
> of higher number of platters?

Not for your situation, no.   Your issue is random seek times for multiple
simultaneous seek requests and batched checkpoint updates.   Things that help
with this are:
More spindles
Better controllers, both RAID and individual disks
Faster drives

Particularly, I'd check out stuff like reports from Tom's Hardware for
evaluating the real speed of drives and seek times.   Often a really good
10000 RPM SCSI will beat a 15000RPM SCSI if the latter has poor onboard
programming.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Scaling further up

From
"scott.marlowe"
Date:
On Tue, 2 Mar 2004, Anjan Dave wrote:

> "By lots I mean dozen(s) in a raid 10 array with a good controller."
>
> I believe, for RAID-10, I will need even number of drives.

Correct.

> Currently,
> the size of the database is about 13GB, and is not expected to grow
> exponentially with thousands of concurrent users, so total space is not
> of paramount importance compared to performance.
>
> Does this sound reasonable setup?
> 10x36GB FC drives on RAID-10
> 4x36GB FC drives for the logs on RAID-10 (not sure if this is the
> correct ratio)?
> 1 hotspare
> Total=15 Drives per enclosure.

Putting the Logs on RAID-10 is likely to be slower than, or no faster than
putting them on RAID-1, since the RAID-10 will have to write to 4 drives,
while the RAID-1 will only have to write to two drives.  now, if you were
reading in the logs a lot, it might help to have the RAID-10.

> Tentatively, I am looking at an entry-level EMC CX300 product with 2GB
> RAID cache, etc.

Pick up a spare, I'll get you my home address, etc... :-)

Seriously, that's huge.  At that point you may well find that putting
EVERYTHING on a big old RAID-5 performs best, since you've got lots of
caching / write buffering going on.

> Question - Are 73GB drives supposed to give better performance because
> of higher number of platters?

Generally, larger hard drives perform better than smaller hard drives
because they a: have more heads and / or b: have a higher areal density.

It's a common misconception that faster RPM drives are a lot faster, when,
in fact, their only speed advantage is slight faster seeks.  The areal
density of faster spinning hard drives tends to be somewhat less than the
slower spinning drives, since the maximum frequency the heads can work in
on both drives, assuming the same technology, is the same.  I.e. the speed
at which you can read data off of the platter doesn't usually go up with a
higher RPM drive, only the speed with which you can get to the first
sector.


Re: Scaling further up

From
"Anjan Dave"
Date:
That was part of my original question - whether it makes sense to go for
a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to
high amounts of memory, and shouldn't have any issues addressing it all.
I've had that kind of setup once temporarily on a V480 (quad UltraSparc,
16GB RAM) machine, and it did well in production use. Without having the
time/resources to do extensive testing,  I am not sure if
Postgres/Solaris9 is really suggested by the community for
high-performance, as opposed to a XEON/Linux setup. Storage being a
separate discussion.

Thanks,
Anjan

-----Original Message-----
From: Chris Ruprecht [mailto:chris@ruprecht.org]
Sent: Tuesday, March 02, 2004 4:17 PM
To: Anjan Dave; fred@redhotpenguin.com; William Yu
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Scaling further up


Hi all,

If you have a DB of 'only' 13 GB and you do not expect it to grow much,
it
might be advisable to have enough memory (RAM) to hold the entire DB in
shared memory (everything is cached). If you have a server with say 24
GB or
memory and can allocate 20 GB for cache, you don't care about the speed
of
disks any more - all you worry about is the speed of your memory and
your
network connection.
I believe, this not possible using 32-bit technology, you would have to
go to
some 64-bit platform, but if it's speed you want ...
You can also try solid state hard disk drives. These are actually just
meory,
there are no moving parts, but the look and behave like very very fast
disk
drives. I have seen them at capacities of 73 GB - but they didn't
mention the
price (I'd probably have a heart attack when I look at the price tag).

Best regards,
Chris


On Tuesday 02 March 2004 14:41, Anjan Dave wrote:
> "By lots I mean dozen(s) in a raid 10 array with a good controller."
>
> I believe, for RAID-10, I will need even number of drives. Currently,
> the size of the database is about 13GB, and is not expected to grow
> exponentially with thousands of concurrent users, so total space is
> not of paramount importance compared to performance.
>
> Does this sound reasonable setup?
> 10x36GB FC drives on RAID-10
> 4x36GB FC drives for the logs on RAID-10 (not sure if this is the
> correct ratio)? 1 hotspare
> Total=15 Drives per enclosure.
>
> Tentatively, I am looking at an entry-level EMC CX300 product with 2GB

> RAID cache, etc.
>
> Question - Are 73GB drives supposed to give better performance because

> of higher number of platters?
>
> Thanks,
> Anjan
>
>
> -----Original Message-----
> From: Fred Moyer [mailto:fred@redhotpenguin.com]
> Sent: Tuesday, March 02, 2004 5:57 AM
> To: William Yu; Anjan Dave
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Scaling further up
>
> On Tue, 2004-03-02 at 17:42, William Yu wrote:
> > Anjan Dave wrote:
> > > We have a Quad-Intel XEON 2.0GHz (1MB cache), 12GB memory, running

> > > RH9, PG 7.4.0. There's an internal U320, 10K RPM RAID-10 setup on
> > > 4
>
> drives.
>
> > > We are expecting a pretty high load, a few thousands of
> > > 'concurrent' users executing either select, insert, update,
> > > statments.
> >
> > The quick and dirty method would be to upgrade to the recently
> > announced 3GHz Xeon MPs with 4MB of L3. My semi-educated guess is
> > that you'd get
> >
> > another +60% there due to the huge L3 hiding the Xeon's shared bus
>
> penalty.
>
> If you are going to have thousands of 'concurrent' users you should
> seriously consider the 2.6 kernel if you are running Linux or as an
> alternative going with FreeBSD.  You will need to load test your
> system and become an expert on tuning Postgres to get the absolute
> maximum performance from each and every query you have.
>
> And you will need lots of hard drives.  By lots I mean dozen(s) in a
> raid 10 array with a good controller.  Thousands of concurrent users
> means hundreds or thousands of transactions per second.  I've
> personally seen it scale that far but in my opinion you will need a
> lot more hard drives and ram than cpu.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings


Re: Scaling further up

From
"Anjan Dave"
Date:
Here's what I recorded today from iostat (linux, iostat -x -k, sda3 is
the pg slice, logs included) during peak time on the RAID-10 array -
What i see is mostly writes, and sometimes, quite a bit of writing,
during which the average wait times shoot up.

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
/dev/sda3   18.81 113.21  3.90 36.33  181.54 1207.75    90.77   603.88
34.54     0.49    0.73   0.22   0.87
/dev/sda3    0.00 208.00  0.00 150.00    0.00 2884.00     0.00  1442.00
19.23     0.75    0.50   0.33   5.00
/dev/sda3    0.00 239.00  0.00 169.00    0.00 3264.00     0.00  1632.00
19.31     2.15    1.27   0.33   5.50
/dev/sda3    0.00 224.50  0.00 158.00    0.00 3060.00     0.00  1530.00
19.37     1.90    1.20   0.28   4.50
/dev/sda3    0.00 157.00  0.00 117.00    0.00 2192.00     0.00  1096.00
18.74     0.40    0.34   0.30   3.50
/dev/sda3    0.00 249.50  0.00 179.00    0.00 3596.00     0.00  1798.00
20.09    21.40   10.78   0.39   7.00
/dev/sda3    0.00 637.50  0.00 620.50    0.00 9936.00     0.00  4968.00
16.01  1137.15  183.55   1.85 115.00
/dev/sda3    0.00 690.00  0.00 548.50    0.00 9924.00     0.00  4962.00
18.09    43.10    7.82   0.46  25.50
/dev/sda3    0.00 485.00  0.00 392.00    0.00 7028.00     0.00  3514.00
17.93    86.90   22.21   1.14  44.50
/dev/sda3    0.00 312.50  0.00 206.50    0.00 4156.00     0.00  2078.00
20.13     3.50    1.69   0.53  11.00
/dev/sda3    0.00 386.50  0.00 275.50    0.00 5336.00     0.00  2668.00
19.37    16.80    6.10   0.60  16.50
/dev/sda3    0.00 259.00  0.00 176.50    0.00 3492.00     0.00  1746.00
19.78     3.25    1.84   0.40   7.00
/dev/sda3    0.00 196.00  0.00 99.00    0.00 2360.00     0.00  1180.00
23.84     0.10    0.10   0.10   1.00
/dev/sda3    0.00 147.00  0.00 100.00    0.00 1976.00     0.00   988.00
19.76     0.50    0.50   0.45   4.50
/dev/sda3    0.00 126.50  0.00 94.50    0.00 1768.00     0.00   884.00
18.71     0.20    0.21   0.21   2.00
/dev/sda3    0.00 133.50  0.00 106.50    0.00 1920.00     0.00   960.00
18.03     0.50    0.47   0.47   5.00
/dev/sda3    0.00 146.50  0.00 118.00    0.00 2116.00     0.00  1058.00
17.93     0.20    0.21   0.17   2.00
/dev/sda3    0.00 156.00  0.00 128.50    0.00 2276.00     0.00  1138.00
17.71     0.35    0.27   0.27   3.50
/dev/sda3    0.00 145.00  0.00 105.00    0.00 2000.00     0.00  1000.00
19.05     0.25    0.24   0.24   2.50
/dev/sda3    0.00  72.96  0.00 54.51    0.00 1019.74     0.00   509.87
18.71     0.17    0.31   0.31   1.72
/dev/sda3    0.00 168.50  0.00 139.50    0.00 2464.00     0.00  1232.00
17.66     0.65    0.47   0.39   5.50
/dev/sda3    0.00 130.50  0.00 100.00    0.00 1844.00     0.00   922.00
18.44     0.00    0.00   0.00   0.00
/dev/sda3    0.00 122.00  0.00 101.00    0.00 1784.00     0.00   892.00
17.66     0.25    0.25   0.25   2.50
/dev/sda3    0.00 143.00  0.00 121.50    0.00 2116.00     0.00  1058.00
17.42     0.25    0.21   0.21   2.50
/dev/sda3    0.00 134.50  0.00 96.50    0.00 1848.00     0.00   924.00
19.15     0.35    0.36   0.36   3.50
/dev/sda3    0.00 153.50  0.00 115.00    0.00 2148.00     0.00  1074.00
18.68     0.35    0.30   0.30   3.50
/dev/sda3    0.00 101.50  0.00 80.00    0.00 1452.00     0.00   726.00
18.15     0.20    0.25   0.25   2.00
/dev/sda3    0.00 108.50  0.00 92.50    0.00 1608.00     0.00   804.00
17.38     0.25    0.27   0.27   2.50
/dev/sda3    0.00 179.00  0.00 132.50    0.00 2492.00     0.00  1246.00
18.81     0.55    0.42   0.42   5.50
/dev/sda3    1.00 113.00  1.00 83.00   16.00 1568.00     8.00   784.00
18.86     0.15    0.18   0.12   1.00
/dev/sda3    0.00 117.00  0.00 97.50    0.00 1716.00     0.00   858.00
17.60     0.20    0.21   0.21   2.00
/dev/sda3    0.00 541.00  0.00 415.50    0.00 7696.00     0.00  3848.00
18.52   146.50   35.09   1.37  57.00
/dev/sda3    0.00 535.00  0.00 392.50    0.00 7404.00     0.00  3702.00
18.86   123.70   31.67   1.31  51.50
/dev/sda3    0.00 993.50  0.00 697.50    0.00 13544.00     0.00  6772.00
19.42   174.25   24.98   1.25  87.00
/dev/sda3    0.00 245.00  0.00 108.50    0.00 2832.00     0.00  1416.00
26.10     0.55    0.51   0.51   5.50

-----Original Message-----
From: scott.marlowe [mailto:scott.marlowe@ihs.com]
Sent: Tuesday, March 02, 2004 4:16 PM
To: Anjan Dave
Cc: fred@redhotpenguin.com; William Yu; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Scaling further up


On Tue, 2 Mar 2004, Anjan Dave wrote:

> "By lots I mean dozen(s) in a raid 10 array with a good controller."
>
> I believe, for RAID-10, I will need even number of drives.

Correct.

> Currently,
> the size of the database is about 13GB, and is not expected to grow
> exponentially with thousands of concurrent users, so total space is
> not of paramount importance compared to performance.
>
> Does this sound reasonable setup?
> 10x36GB FC drives on RAID-10
> 4x36GB FC drives for the logs on RAID-10 (not sure if this is the
> correct ratio)? 1 hotspare
> Total=15 Drives per enclosure.

Putting the Logs on RAID-10 is likely to be slower than, or no faster
than
putting them on RAID-1, since the RAID-10 will have to write to 4
drives,
while the RAID-1 will only have to write to two drives.  now, if you
were
reading in the logs a lot, it might help to have the RAID-10.

> Tentatively, I am looking at an entry-level EMC CX300 product with 2GB

> RAID cache, etc.

Pick up a spare, I'll get you my home address, etc... :-)

Seriously, that's huge.  At that point you may well find that putting
EVERYTHING on a big old RAID-5 performs best, since you've got lots of
caching / write buffering going on.

> Question - Are 73GB drives supposed to give better performance because

> of higher number of platters?

Generally, larger hard drives perform better than smaller hard drives
because they a: have more heads and / or b: have a higher areal density.

It's a common misconception that faster RPM drives are a lot faster,
when,
in fact, their only speed advantage is slight faster seeks.  The areal
density of faster spinning hard drives tends to be somewhat less than
the
slower spinning drives, since the maximum frequency the heads can work
in
on both drives, assuming the same technology, is the same.  I.e. the
speed
at which you can read data off of the platter doesn't usually go up with
a
higher RPM drive, only the speed with which you can get to the first
sector.


Re: Scaling further up

From
"scott.marlowe"
Date:
On Tue, 2 Mar 2004, Anjan Dave wrote:

> That was part of my original question - whether it makes sense to go for
> a mid-range SunFire machine (64bit HW, 64bit OS), which is scalable to
> high amounts of memory, and shouldn't have any issues addressing it all.
> I've had that kind of setup once temporarily on a V480 (quad UltraSparc,
> 16GB RAM) machine, and it did well in production use. Without having the
> time/resources to do extensive testing,  I am not sure if
> Postgres/Solaris9 is really suggested by the community for
> high-performance, as opposed to a XEON/Linux setup. Storage being a
> separate discussion.

Some folks on the list have experience with Postgresql on Solaris, and
they generally say they use Solaris not for performance reasons, but for
reliability reasons.  I.e. the bigger Sun hardware is fault tolerant.

For speed, the X86 32 and 64 bit architectures seem to be noticeable
faster than Sparc.  However, running Linux or BSD on Sparc make them
pretty fast too, but you lose the fault tolerant support for things like
hot swappable CPUs or memory.



Re: Scaling further up

From
Rod Taylor
Date:
> For speed, the X86 32 and 64 bit architectures seem to be noticeable
> faster than Sparc.  However, running Linux or BSD on Sparc make them
> pretty fast too, but you lose the fault tolerant support for things like
> hot swappable CPUs or memory.

Agreed.. You can get a Quad Opteron with 16GB memory for around 20K.

Grab 3, a cheap SAN and setup a little master/slave replication with
failover (how is Slony coming?), and you're all set.



Re: Scaling further up

From
johnnnnnn
Date:
On Tue, Mar 02, 2004 at 02:16:24PM -0700, scott.marlowe wrote:
> It's a common misconception that faster RPM drives are a lot faster,
> when, in fact, their only speed advantage is slight faster seeks.
> The areal density of faster spinning hard drives tends to be
> somewhat less than the slower spinning drives, since the maximum
> frequency the heads can work in on both drives, assuming the same
> technology, is the same.  I.e. the speed at which you can read data
> off of the platter doesn't usually go up with a higher RPM drive,
> only the speed with which you can get to the first sector.

This would imply that an upgrade in drive RPM should be accompanied by
a decrease in random_page_cost, correct?

random_page_cost should be set with the following things taken into
account:
  - seek speed
  - likelihood of page to be cached in memory by the kernel
  - anything else?


Sorry, i realize this pulls the thread a bit off-topic, but i've heard
that about RPM speeds before, and i just want some confirmation that
my deductions are reasonable.

-johnnnnnnnnnnn

Re: Scaling further up

From
"Anjan Dave"
Date:
Can you describe the vendors/components of a "cheap SAN setup?"

Thanks,
Anjan

-----Original Message-----
From: Rod Taylor [mailto:pg@rbt.ca]
Sent: Tuesday, March 02, 2004 5:57 PM
To: Scott Marlowe
Cc: Anjan Dave; Chris Ruprecht; fred@redhotpenguin.com; William Yu;
Postgresql Performance
Subject: Re: [PERFORM] Scaling further up


> For speed, the X86 32 and 64 bit architectures seem to be noticeable
> faster than Sparc.  However, running Linux or BSD on Sparc make them
> pretty fast too, but you lose the fault tolerant support for things
like
> hot swappable CPUs or memory.

Agreed.. You can get a Quad Opteron with 16GB memory for around 20K.

Grab 3, a cheap SAN and setup a little master/slave replication with
failover (how is Slony coming?), and you're all set.



Re: Scaling further up

From
Jeff
Date:
On Mar 2, 2004, at 5:36 PM, scott.marlowe wrote:

> Some folks on the list have experience with Postgresql on Solaris, and
> they generally say they use Solaris not for performance reasons, but
> for
> reliability reasons.  I.e. the bigger Sun hardware is fault tolerant.
>
Solaris isn't nearly as bad for PG as it used to be.

But as you say - the #1 reason to use sun is reliability. (In my case,
it was because we had a giant sun laying around :)

I'm trying to remember exactly what happens..  but I know on sun if it
had a severe memory error it kills off processes with data on that dimm
(Since it has no idea if it is bad or not.  Thanks to ECC this is very
rare, but it can happen.).  I want to say if a CPU dies any processes
running on it at that moment are also killed.  but the more I think
about that th emore I don't think that is the case.

As for x86.. if ram or a cpu goes bad you're SOL.

Although opterons are sexy you need to remember they really are brand
new cpus - I'm sure AMD has done tons of testing but sun ultrasparc's
have been in every situation conceivable in production.   If you are
going to really have thousands of users you probably want to bet the
farm on something proven.

lots and lots of spindles
lots and lots of ram

You may also want to look into a replication solution as a hot backup.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: Scaling further up

From
Rod Taylor
Date:
On Tue, 2004-03-02 at 18:24, Anjan Dave wrote:
> Can you describe the vendors/components of a "cheap SAN setup?"

heh.. Excellent point.

My point was that you could get away with a smaller setup (number of
disks) if it doesn't have to deal with reads and writes are not time
dependent than you will if you attempt to pull 500MB/sec off the disks.

If it is foreseeable that the database can be held in Ram, that it is
much easier and cheaper way to get high IO than with physical disks.



Re: Scaling further up

From
Paul Thomas
Date:
On 02/03/2004 23:25 johnnnnnn wrote:
> [snip]
> random_page_cost should be set with the following things taken into
> account:
>   - seek speed

Which is not exactly the same thing as spindle speed as it's a combination
of spindle speed and track-to-track speed. I think you'll find that a 15K
rpm disk, whilst it will probably have a lower seek time than a 10K rpm
disk, won't have a proportionately (i.e., 2/3rds) lower seek time.

>   - likelihood of page to be cached in memory by the kernel

That's effective cache size.


--
Paul Thomas
+------------------------------+---------------------------------------------+
| Thomas Micro Systems Limited | Software Solutions for the Smaller
Business |
| Computer Consultants         |
http://www.thomas-micro-systems-ltd.co.uk   |
+------------------------------+---------------------------------------------+

Re: Scaling further up

From
Andrew Sullivan
Date:
On Tue, Mar 02, 2004 at 04:50:04PM -0500, Anjan Dave wrote:
> time/resources to do extensive testing,  I am not sure if
> Postgres/Solaris9 is really suggested by the community for
> high-performance, as opposed to a XEON/Linux setup. Storage being a
> separate discussion.

I can tell you from experience that performance on Solaris is nowhere
close to what you'd expect, given the coin you're forking over for
it.  I think the reason to use Solaris is its support for all the
nifty hot-swappable hardware, and not for its speed or any putative
benefit you might get from having 64 bits at your disposal.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
The fact that technology doesn't work is no bar to success in the marketplace.
        --Philip Greenspun

Re: Scaling further up

From
Josh Berkus
Date:
John,

> This would imply that an upgrade in drive RPM should be accompanied by
> a decrease in random_page_cost, correct?

Maybe.  Maybe not.   Tom's Hardware did some Bonnie++ testing with a variety
of new drives last year.   They were moderately surprised to find that there
were "faster" drives (i.e. higher RPM) which had lower real throughput due to
poor onboard software and hardware, such as a small and slow onboard cache.

So, it would be reasonable to assume that a 10,000 RPM Barracuda could support
marginally lower random_page_cost than a 7,200 RPM Barracuda ... but that
tells you nothing about a 10,000 RPM Maxtor Diamond (as an example).

Also, many other factors influence real random_page_cost; the size and access
pattern of your database is probably much more important than your RPM.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Scaling further up

From
"scott.marlowe"
Date:
On Wed, 3 Mar 2004, Paul Thomas wrote:

>
> On 02/03/2004 23:25 johnnnnnn wrote:
> > [snip]
> > random_page_cost should be set with the following things taken into
> > account:
> >   - seek speed
>
> Which is not exactly the same thing as spindle speed as it's a combination
> of spindle speed and track-to-track speed. I think you'll find that a 15K
> rpm disk, whilst it will probably have a lower seek time than a 10K rpm
> disk, won't have a proportionately (i.e., 2/3rds) lower seek time.

There are three factors that affect how fast you can get to the next
sector:

seek time
settle time
rotational latency

Most drives only list the first, and don't bother to mention the other
two.

On many modern drives, the seek times are around 5 to 10 milliseconds.
The settle time varies as well.  the longer the seek, the longer the
settle, generally.  This is the time it takes for the head to stop shaking
and rest quietly over a particular track.
Rotational Latency is the amount of time you have to wait, on average, for
the sector you want to come under the heads.

Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the
rotational latency on average is 1/2 of a rotation:  At 10k rpm, a
rotation takes 1/166.667 of a second, or 6 mS.  So, a half a rotation is
approximately 3 mS.  By going to a 15k rpm drive, the latency drops to 2
mS.  So, if we add them up, on the same basic drive, one being 10k and one
being 15k, we get:

10krpm: 8+2+3 = 13 mS
15krpm: 8+2+2 = 12 mS

So, based on the decrease in rotational latency being the only advantage
the 15krpm drive has over the 10krpm drive, we get an decrease in access
time of only 1 mS, or only about an 8% decrease in actual seek time.

So, if you're random page cost on 10krpm drives was 1.7, you'd need to
drop it to 1.57 or so to reflect the speed increase from 15krpm drives.

I.e. it's much more likely that going from 1 gig to 2 gigs of ram will
make a noticeable difference than going from 10k to 15k drives.


Re: Scaling further up

From
Paul Thomas
Date:
On 03/03/2004 18:23 scott.marlowe wrote:
> [snip]
> There are three factors that affect how fast you can get to the next
> sector:
>
> seek time
> settle time
> rotational latency
>
> Most drives only list the first, and don't bother to mention the other
> two.

Ah yes, one of my (very) few still functioning brain cells was nagging
about another bit of time in the equation :)

> On many modern drives, the seek times are around 5 to 10 milliseconds.
> [snip]

Going back to the OPs posting about random_page_cost, imagine I have 2
servers identical in every way except the disk drive. Server A has a 10K
rpm drive and server B has a 15K rpm drive. Seek/settle times aren't
spectacularly different between the 2 drives. I'm wondering if drive B
might actually merit a _higher_ random_page_cost than drive A as, once it
gets settled on a disk track, it can suck the data off a lot faster.
opinions/experiences anyone?


--
Paul Thomas
+------------------------------+---------------------------------------------+
| Thomas Micro Systems Limited | Software Solutions for the Smaller
Business |
| Computer Consultants         |
http://www.thomas-micro-systems-ltd.co.uk   |
+------------------------------+---------------------------------------------+

Re: Scaling further up

From
"Anjan Dave"
Date:
Great response, Thanks.

Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't
understand is that even though the OS caches most of the memory and PG
can use it if it needs it, why would the system swap (not much, only
during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB,
effective cache size is 2GB, sort mem is 2MB, rest is default values. It
also happens that a large query (reporting type) can hold up the other
queries, and the load averages shoot up during peak times.

Regarding a baseline -

-We have docs and monitor for frequency of sql statements, most
expensive ones, etc. (IronEye)
-I am monitoring disk reads/writes using iostat
-How do I measure commit frequency, and system events like checkpoint?
(vacuum is done nightly during less or no load)

Thanks,
Anjan


-----Original Message-----
From: Aaron W [mailto:aaronwerman@yahoo.com]
Sent: Thursday, March 04, 2004 8:58 AM
To: pgsql-performance@postgresql.org; Anjan Dave
Subject: Re: Scaling further up


....
I'd look at adding more disks first. Depending on what
type of query
load you get, that box sounds like it will be very
much I/O bound....

Given a a 13G database on a 12G system, with a low
growth rate, it is likely that there is almost no I/O
for most activities. The exception is checkpointing.
The first thing I'd do is try to build a spreadsheet
model of:
- select frequency, and # logical and physical reads
involved
- insert/delete/update frequency, and # logical and
physical read and writes involved
- commit frequency, etc.
(start out with simplistic assumptions, and do it for
peak load)
- system events (checkpoints, vacuum)

I assume that the only high I/O you will see will be
for logging. The RAID issue there is basically
obviated by the sequential write nature of WAL. If
that is the case, EMC is not the most cost effective
or obvious  solution - since the value they provide is
mostly manageability for disaster recovery. The goal
in this case is to write at the application max speed,
and with  mimimal latency. Any responsible battery
backed up write through (mirrored) cached controller
can do that for you.

On the other hand, if your requests are not *all*
trivial, you are going to test the hardware and
scheduling algorithms of OS and pg. Even if 0.1% of
3,000 tps take a second - that ends up generating 3
seconds of load.... Any, even slightly, slow
transactions will generate enormous queues which slow
down everything.

In most systems of this volume I've seen, the mix of
activities is constantly invalidating cache, making L2
caching less important. Memory to CPU bus speed is a
limiting factor, as well as raw CPU speed in
processing the requests. Xeon is not a great
architecture for this because of FSB contention; I
suspect a 4-way will be completely FSB bottlenecked so
a more than 4 way would likely not change performance.


I would try to get a simple model/benchmark going and
test against it. You should be talking to the big iron
vendors for their take on your issues and get their
capacity benchmarks.

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you're looking for faster
http://search.yahoo.com

Re: Scaling further up

From
"scott.marlowe"
Date:
On Thu, 4 Mar 2004, Paul Thomas wrote:

> On 03/03/2004 18:23 scott.marlowe wrote:
> > [snip]
> > There are three factors that affect how fast you can get to the next
> > sector:
> >
> > seek time
> > settle time
> > rotational latency
> >
> > Most drives only list the first, and don't bother to mention the other
> > two.
>
> Ah yes, one of my (very) few still functioning brain cells was nagging
> about another bit of time in the equation :)
>
> > On many modern drives, the seek times are around 5 to 10 milliseconds.
> > [snip]
>
> Going back to the OPs posting about random_page_cost, imagine I have 2
> servers identical in every way except the disk drive. Server A has a 10K
> rpm drive and server B has a 15K rpm drive. Seek/settle times aren't
> spectacularly different between the 2 drives. I'm wondering if drive B
> might actually merit a _higher_ random_page_cost than drive A as, once it
> gets settled on a disk track, it can suck the data off a lot faster.
> opinions/experiences anyone?

It might well be that you have higher settle times that offset the small
gain in rotational latency.  I haven't looked into it, so I don't know one
way or the other, but it seems a reasonable assumption.

However, a common misconception is that the higher angular velocity of
the 15krpm drives would allow you to read data faster.  In fact, the limit
of how fast you can read is set by the head.  There's a maximum frequency
that it can read, and the areal density / rpm have to be such that you
don't exceed that frequency.  OFten, the speed at which you read off the
platters is exactly the same between a 10k and 15k of the same family.

The required lower areal density is the reason 15krpm drives show up in
the lower capacities first.


Re: Scaling further up

From
William Yu
Date:
Anjan Dave wrote:
> Great response, Thanks.
>
> Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't
> understand is that even though the OS caches most of the memory and PG
> can use it if it needs it, why would the system swap (not much, only
> during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB,
> effective cache size is 2GB, sort mem is 2MB, rest is default values. It
> also happens that a large query (reporting type) can hold up the other
> queries, and the load averages shoot up during peak times.

In regards to your system going to swap, the only item I see is sort_mem
at 2MB. How many simultaneous transactions do you get? If you get
hundreds or thousands like your first message stated, every select sort
would take up 2MB of memory regardless of whether it needed it or not.
That could cause your swap activity during peak traffic.

The only other item to bump up is the effective cache size -- I'd set it
to 12GB.


Re: Scaling further up

From
Aaron W
Date:
....
I'd look at adding more disks first. Depending on what
type of query
load you get, that box sounds like it will be very
much I/O bound....

Given a a 13G database on a 12G system, with a low
growth rate, it is likely that there is almost no I/O
for most activities. The exception is checkpointing.
The first thing I'd do is try to build a spreadsheet
model of:
- select frequency, and # logical and physical reads
involved
- insert/delete/update frequency, and # logical and
physical read and writes involved
- commit frequency, etc.
(start out with simplistic assumptions, and do it for
peak load)
- system events (checkpoints, vacuum)

I assume that the only high I/O you will see will be
for logging. The RAID issue there is basically
obviated by the sequential write nature of WAL. If
that is the case, EMC is not the most cost effective
or obvious  solution - since the value they provide is
mostly manageability for disaster recovery. The goal
in this case is to write at the application max speed,
and with  mimimal latency. Any responsible battery
backed up write through (mirrored) cached controller
can do that for you.

On the other hand, if your requests are not *all*
trivial, you are going to test the hardware and
scheduling algorithms of OS and pg. Even if 0.1% of
3,000 tps take a second - that ends up generating 3
seconds of load.... Any, even slightly, slow
transactions will generate enormous queues which slow
down everything.

In most systems of this volume I've seen, the mix of
activities is constantly invalidating cache, making L2
caching less important. Memory to CPU bus speed is a
limiting factor, as well as raw CPU speed in
processing the requests. Xeon is not a great
architecture for this because of FSB contention; I
suspect a 4-way will be completely FSB bottlenecked so
a more than 4 way would likely not change performance.


I would try to get a simple model/benchmark going and
test against it. You should be talking to the big iron
vendors for their take on your issues and get their
capacity benchmarks.

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you�re looking for faster
http://search.yahoo.com

Re: Scaling further up

From
Robert Treat
Date:
On Mon, 2004-03-08 at 11:40, William Yu wrote:
> Anjan Dave wrote:
> > Great response, Thanks.
> >
> > Regarding 12GB memory and 13G db, and almost no I/O, one thing I don't
> > understand is that even though the OS caches most of the memory and PG
> > can use it if it needs it, why would the system swap (not much, only
> > during peak times)? The SHMMAX is set to 512MB, shared_buffers is 150MB,
> > effective cache size is 2GB, sort mem is 2MB, rest is default values. It
> > also happens that a large query (reporting type) can hold up the other
> > queries, and the load averages shoot up during peak times.
>
> In regards to your system going to swap, the only item I see is sort_mem
> at 2MB. How many simultaneous transactions do you get? If you get
> hundreds or thousands like your first message stated, every select sort
> would take up 2MB of memory regardless of whether it needed it or not.
> That could cause your swap activity during peak traffic.
>
> The only other item to bump up is the effective cache size -- I'd set it
> to 12GB.
>

Was surprised that no one corrected this bit of erroneous info (or at
least I didn't see it) so thought I would for completeness. a basic
explanation is that sort_mem controls how much memory a given query is
allowed to use before spilling to disk, but it will not grab that much
memory if it doesn't need it.

See the docs for a more detailed explanation:
http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTIME-CONFIG-RESOURCE

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


Re: Scaling further up

From
"Anjan Dave"
Date:
We upgraded from 8GB to 12GB RAM a month or so ago, but even in the
past, I've never seen the system exhaust on it's system cache (~6GB, in
'top'), while it's swapping.

Some one had mentioned why not have the entire DB in memory? How do I
configure that, for knowledge?

Max connections is set to 500, and we haven't bumped it yet. (I've seen
over 200 active queries, but the traffic is seasonal, so the high
connection value)

Thanks,
Anjan

-----Original Message-----
From: Robert Treat [mailto:xzilla@users.sourceforge.net]
Sent: Friday, March 12, 2004 6:02 PM
To: William Yu
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Scaling further up


On Mon, 2004-03-08 at 11:40, William Yu wrote:
> Anjan Dave wrote:
> > Great response, Thanks.
> >
> > Regarding 12GB memory and 13G db, and almost no I/O, one thing I
> > don't understand is that even though the OS caches most of the
> > memory and PG can use it if it needs it, why would the system swap
> > (not much, only during peak times)? The SHMMAX is set to 512MB,
> > shared_buffers is 150MB, effective cache size is 2GB, sort mem is
> > 2MB, rest is default values. It also happens that a large query
> > (reporting type) can hold up the other queries, and the load
> > averages shoot up during peak times.
>
> In regards to your system going to swap, the only item I see is
> sort_mem
> at 2MB. How many simultaneous transactions do you get? If you get
> hundreds or thousands like your first message stated, every select
sort
> would take up 2MB of memory regardless of whether it needed it or not.

> That could cause your swap activity during peak traffic.
>
> The only other item to bump up is the effective cache size -- I'd set
> it
> to 12GB.
>

Was surprised that no one corrected this bit of erroneous info (or at
least I didn't see it) so thought I would for completeness. a basic
explanation is that sort_mem controls how much memory a given query is
allowed to use before spilling to disk, but it will not grab that much
memory if it doesn't need it.

See the docs for a more detailed explanation:
http://www.postgresql.org/docs/7.4/interactive/runtime-config.html#RUNTI
ME-CONFIG-RESOURCE

Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: Scaling further up

From
Marty Scholes
Date:
I have some suggestions based on my anecdotal experience.

1. This is a relatively small DB -- the working set will likely be in
RAM at any moment in time, making read I/O time mostly irrelevant.

2. The killer will be write times -- specifically log writes.  Small and
heavily synchronized writes, log and data writes, will drag down an
impressive hardware RAID setup.  We run mirrored hardware RAID 5 arrays
with write back cache and are constantly seeking ways to improve write
performance.  We do a lot of batch processing, though, so we do a lot of
write I/Os.

3. Be very careful with "battery backed write cache."  It usually works
as advertised.  More than once in the past decade I have seen
spontaneous cache corruption after power losss.  The corruption usually
happens when some admin, including me, has assumed that the cache will
ALWAYS survive a power failure unblemished and has no "plan B."  Make
sure you have a contingency plan for corruption, or don't enable the cache.

4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of
the setup, and might hinder, not help small write I/O performance.

5. Most (almost all) of the I/O time will be due to the access time
(head seek + head settle + rotational latency) and very little of the
I/O time will due to data transfer time.  In other words, getting drives
that provide faster transfer rates will barely improve performance.  The
secret is lowering the access time.

6. A relatively cheap way to drastically drop the access time is to get
large drive(s) and only use a portion of them for storage.  The less
space used on the drive, the less area the heads need to cover for
seeks.  At one extreme, you could make the partition the size of a
single cylinder.  This would make access time (ignoring OS and
controller overhead) identical to rotational latency, which is as low as
4.2 ms for a cheap 7200 RPM drive.

7. A drive with a 5 ms average service time, servicing 8 KB blocks, will
yield as much as 1.6 MB/s sustained write throughput.  Not bad for a
cheap uncached solution.  Any OS aggregation of writes during the
fsync() call will further improve this number -- it is basically a lower
bound for throughput.

8. Many people, especially managers, cannot stomach buying disk space
and only using a portion of it.  In many cases, it seems more palatable
to purchase a much more expensive solution to get to the same speeds.

Good luck.

scott.marlowe wrote:
> On Wed, 3 Mar 2004, Paul Thomas wrote:
>
>  >
>  > On 02/03/2004 23:25 johnnnnnn wrote:
>  > > [snip]
>  > > random_page_cost should be set with the following things taken into
>  > > account:
>  > >   - seek speed
>  >
>  > Which is not exactly the same thing as spindle speed as it's a
> combination
>  > of spindle speed and track-to-track speed. I think you'll find that a
> 15K
>  > rpm disk, whilst it will probably have a lower seek time than a 10K rpm
>  > disk, won't have a proportionately (i.e., 2/3rds) lower seek time.
>
> There are three factors that affect how fast you can get to the next
> sector:
>
> seek time
> settle time
> rotational latency
>
> Most drives only list the first, and don't bother to mention the other
> two.
>
> On many modern drives, the seek times are around 5 to 10 milliseconds.
> The settle time varies as well.  the longer the seek, the longer the
> settle, generally.  This is the time it takes for the head to stop shaking
> and rest quietly over a particular track.
> Rotational Latency is the amount of time you have to wait, on average, for
> the sector you want to come under the heads.
>
> Assuming an 8 ms seek, and 2 ms settle (typical numbers), and that the
> rotational latency on average is 1/2 of a rotation:  At 10k rpm, a
> rotation takes 1/166.667 of a second, or 6 mS.  So, a half a rotation is
> approximately 3 mS.  By going to a 15k rpm drive, the latency drops to 2
> mS.  So, if we add them up, on the same basic drive, one being 10k and one
> being 15k, we get:
>
> 10krpm: 8+2+3 = 13 mS
> 15krpm: 8+2+2 = 12 mS
>
> So, based on the decrease in rotational latency being the only advantage
> the 15krpm drive has over the 10krpm drive, we get an decrease in access
> time of only 1 mS, or only about an 8% decrease in actual seek time.
>
> So, if you're random page cost on 10krpm drives was 1.7, you'd need to
> drop it to 1.57 or so to reflect the speed increase from 15krpm drives.
>
> I.e. it's much more likely that going from 1 gig to 2 gigs of ram will
> make a noticeable difference than going from 10k to 15k drives.
>
>
>



Re: Scaling further up

From
"Aaron Werman"
Date:
Sorry about not chiming in before - I've been too swamped to think. I agree
with most of the points, but a lot of these posts are interesting and seem
to describe systems from an SA perspective to my DBA-centric view.

----- Original Message -----
From: "Marty Scholes" <marty@outputservices.com>
To: <pgsql-performance@postgresql.org>
Sent: Wednesday, March 10, 2004 6:29 PM
Subject: Re: [PERFORM] Scaling further up


> I have some suggestions based on my anecdotal experience.
>
> 1. This is a relatively small DB -- the working set will likely be in
> RAM at any moment in time, making read I/O time mostly irrelevant.
>
> 2. The killer will be write times -- specifically log writes.  Small and
> heavily synchronized writes, log and data writes, will drag down an
> impressive hardware RAID setup.  We run mirrored hardware RAID 5 arrays
> with write back cache and are constantly seeking ways to improve write
> performance.  We do a lot of batch processing, though, so we do a lot of
> write I/Os.

My experience with RAID5 for streaming sequential writes is bad. This is
sometimes helped by the hardware caching to cover the cost of the additional
I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until
you run out of cache). Batch processing is different from high concurrency
transactions because it needs faster volume streaming, while TP is dependant
on the speed of ack'ing (few big writes with less synchronous waits vs. lots
of small writes which serialize everyone). (RAID 3 worked for me in the past
for logging, but I haven't used it in years.)

>
> 3. Be very careful with "battery backed write cache."  It usually works
> as advertised.  More than once in the past decade I have seen
> spontaneous cache corruption after power losss.  The corruption usually
> happens when some admin, including me, has assumed that the cache will
> ALWAYS survive a power failure unblemished and has no "plan B."  Make
> sure you have a contingency plan for corruption, or don't enable the
cache.

I agree strongly. There is also the same problem with disk write back cache
and even with SCSI controllers with write through enabled. PITR would help
here. A lot of these problems are due to procedural error post crash.

>
> 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of
> the setup, and might hinder, not help small write I/O performance.

In a high volume system without write caching you are almost always going to
see queuing, which can make the larger buffer mostly irrelevant, if it's not
huge. Write caching thrives on big  block sizes (which is a key reason why
Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops
I've set up or seen use mirroring or RAID 10 for logs. Note also that many
RAID 10 controllers in a non-write cached setup allows having a race between
the two writers, acknowledging when the first of the two completes -
increasing throughput by about 1/4.

>
> 5. Most (almost all) of the I/O time will be due to the access time
> (head seek + head settle + rotational latency) and very little of the
> I/O time will due to data transfer time.  In other words, getting drives
> that provide faster transfer rates will barely improve performance.  The
> secret is lowering the access time.

True. This is very much a latency story. Even in volume batch, you can see
access time that clearly shows some other system configuration bottleneck
that happens elsewhere before hitting I/O capacity.

>
> 6. A relatively cheap way to drastically drop the access time is to get
> large drive(s) and only use a portion of them for storage.  The less
> space used on the drive, the less area the heads need to cover for
> seeks.  At one extreme, you could make the partition the size of a
> single cylinder.  This would make access time (ignoring OS and
> controller overhead) identical to rotational latency, which is as low as
> 4.2 ms for a cheap 7200 RPM drive.

This is a good strategy for VLDB, and may not be relevant in this case.

Also - big sequential writes and 15K rpm drives, in the case of
writethrough, is a beautiful thing - they look like a manufacturers' demo. A
primary performance role of a RDBMS is to convert random I/O to sequential
(by buffering reads and using a streaming log to defer random writes to
checkpoints). RDBMS's are the prime beneficiaries of the drive speed
improvements - since logging, backups, and copies are about the only things
(ignoring bad perl scripts and find commands) that generate loads of 50+
mB/sec.

/Aaron

Re: Scaling further up

From
Andrew Sullivan
Date:
On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote:
> We upgraded from 8GB to 12GB RAM a month or so ago, but even in the
> past, I've never seen the system exhaust on it's system cache (~6GB, in
> 'top'), while it's swapping.
>
> Some one had mentioned why not have the entire DB in memory? How do I
> configure that, for knowledge?

You don't.  It'll automatically be in memory if (a) you have enough
memory, (b) you don't have anything else on the machine using the
memory, and (c) it's been read at least one time.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca

Re: Scaling further up

From
Matt Davies
Date:
Quoting Andrew Sullivan <ajs@crankycanuck.ca>:

> On Fri, Mar 12, 2004 at 06:25:48PM -0500, Anjan Dave wrote:
> > We upgraded from 8GB to 12GB RAM a month or so ago, but even in the
> > past, I've never seen the system exhaust on it's system cache (~6GB, in
> > 'top'), while it's swapping.
> >
> > Some one had mentioned why not have the entire DB in memory? How do I
> > configure that, for knowledge?
>
> You don't.  It'll automatically be in memory if (a) you have enough
> memory, (b) you don't have anything else on the machine using the
> memory, and (c) it's been read at least one time.

This is the preferred method, but you could create a memory disk if running
linux. This has several caveats, though.

1. You may have to recompile the kernel for support.
2. You must store the database on a hard drive partition during reboots.
3. Because of #2 this option is generally useful if you have static content that
is loaded to the MD upon startup of the system.

You could have some fancy methodology of shutting down the system and then
copying the data to a disk-based filesystem, but this is inherently bad since
at any moment a power outage would erase any updates changes.

The option is there to start with all data in memory, but in general, this is
probablt not what you want. Just an FYI.

Re: Scaling further up

From
Stephen Robert Norris
Date:
On Tue, 2004-03-16 at 07:28, Matt Davies wrote:
> This is the preferred method, but you could create a memory disk if running
> linux. This has several caveats, though.
>
> 1. You may have to recompile the kernel for support.
> 2. You must store the database on a hard drive partition during reboots.
> 3. Because of #2 this option is generally useful if you have static content that
> is loaded to the MD upon startup of the system.

And 4. You use twice as much memory - one lot for the FS, the second for
buffer cache.

It's generally going to be slower than simply doing some typical queries
to preload the data into buffer cache, I think.

    Stephen

Attachment