Thread: choosing the right platform

choosing the right platform

From

"Matthew Nuzum"

Date:

08 April 2003, 23:38:57

Hello all,

I've been the lead developer of a successful (so-far) web application.
Currently we run the database and the web application on the same server,
but it will soon be necessary to split them and add some more web servers.

I'm not counting on using replication any time soon, so I want to choose the
hardware platform that will meet my needs for some time.  Maybe you can give
some suggestions...

My application is written in PHP and relies on the apache web server.  We
use persistent db connections, so it appears (correct me if I'm wrong) that
every apache child process gets one connection to the database.  If my
MaxClients is 150, each web server will be able to make up to 150 db
connections.  I'd like to play with that number a little bit once I get the
webserver off of the db server.  I feel that I could handle a greater number
of Clients, so let us say that I have up to 200 connections per server.

I'd like to have room to grow, so let's also say that I go to 20 web servers
for a total of 4000 connections.  (I'd probably like to go even higher, so
consider this our starting point)

With a couple dozen active accounts and a lot of test data, my current
database is equiv to about 100 active accounts.  Its current disk space
consumption is:
data # du --max-depth=2
3656    ./base/1
3656    ./base/16975
4292    ./base/95378
177824  ./base/200371
189432  ./base
144     ./global
82024   ./pg_xlog
2192    ./pg_clog
273836  .

This is not optimized and there is a lot of old data, but to be safe, maybe
we should assume that each account uses 4 MB of disk space in the db,
counting indexes, tables and etc.  I'd like to scale to 15,000 - 25,000
accounts, but I doubt that will be feasible at my budget level.  (Also,
there is a lot of optimizing to do, so it won't surprise me if this 4MB
number is more like 2MB or even less)

I'm not as concerned with disk subsystem or layout at the moment.  I've seen
a lot of good documentation (especially from Bruce Momjian, thanks!) on this
subject.  I'm mostly concerned with choosing the platform that's going to
allow the scalability I need.

Currently I'm most experienced in Linux, especially RedHat.  I'm "certified"
on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I don't think it's
called that anymore), Free BSD (3.x) and Solaris (2.x).  I'm most
comfortable with Linux, but I'm willing to use a different platform if it
will be beneficial.  I've heard that Solaris on the Sparc platform is
capable of addressing larger amounts of RAM than Linux on Intel does.  I
don't know if that's true or if that has bearing, but I'd like to hear your
opinions.

My budget is going to be between (US) $5,000 and $10,000 and I'd like to
stay under $7,000.  I'm a major bargain hunter, so I shop e-bay a lot and
here are some samplings that I think may be relevant for discussion:

SUN (I'm not an expert in this, advice is requested)
----------------------------------------------------
SUN ENTERPRISE 4500 8x400 Mhz 4MB Cache CPUs 8GB RAM no hard drives ~$6,000
Sun E3500 - 8 x 336MHz 4MB Cache CPUs 4GB RAM 8 x 9.1GB FC disks ~$600.00
Any other suggestions?

INTEL (I'm much more familiar with this area)
----------------------------------------------------
Compaq DL580 4x700 MHz 2MB Cache CPUs 4GB RAM (16GB Max) HW Raid w/ 64MB
Cache ~$6000
IBM Netfinity 7100 4x500 MHz 1MB Cache CPUs up to (16GB Max) HW Raid
Dell PowerEdge 8450 8x550 2M Cache CPUS 4GB (32GB Max) HS RAID w/ 16MB Cache
~$4,500
Any other suggestions?

Any other hardware platforms I should consider?

Finally, and I know this sounds silly, but I don't have my own data center,
so size is something I need to take into consideration.  I pay for data
center space by the physical size of my servers.  My priorities are
Performance, Reasonable amount of scalability (as outlined above) and
finally physical size.

Thanks for taking the time to read this and for any assistance you can give,

Matthew Nuzum
www.bearfruit.org

Re: choosing the right platform

From

Josh Berkus

Date:

09 April 2003, 12:28:31

Matthew,

> Currently I'm most experienced in Linux, especially RedHat.  I'm
> "certified" on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I
> don't think it's called that anymore), Free BSD (3.x) and Solaris (2.x).
> I'm most
> comfortable with Linux, but I'm willing to use a different platform if it
> will be beneficial.  I've heard that Solaris on the Sparc platform is
> capable of addressing larger amounts of RAM than Linux on Intel does.  I
> don't know if that's true or if that has bearing, but I'd like to hear your
> opinions.

Please browse through the list archives.  We have numerous posts on the
platform subject.   In fact, several of us are trying to put together a
PostgreSQL performance test package to answer this question difinitively
rather than anecdotally.

Anecdotal responses are:

Solaris is *bad* for PostgreSQL, due to a high per-process overhead.
Universal opinion on this list has been that Solaris is optimized for
multi-threaded applications, not multi-process applications, and postgres is
the latter.

*BSD has a *lot* of fans on the PGSQL lists, many of whom claim significantly
better performance than Linux, mostly due to better filesystem I/O.

Linux is used heavily by a lot of PostgreSQL users.  I have yet to see anyone
provide actual Linux vs. BSD statistics, though ... something we hope to do.

Nobody has come forward and reported on PostgreSQL on SCO Unix.

Irix is widely regarded as a "dead" platform, though PostgreSQL does run on it
...

Good luck, and keep watching this space!



--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: choosing the right platform

From

"scott.marlowe"

Date:

09 April 2003, 12:57:11

I would say up front that both Linux and BSD are probably your two best
choices.  If you're familiar with one more than the other, that
familiarity may be more important than the underlying differences in the
two OSes, as they are both good platforms to run postgresql on top of.

Secondly, look carefully at using persistant connections in large numbers.

While persistant connections DO represent a big savings in connect time,
the savings are lost in the noise of many PHP applications.

i.e. my dual PIII swiss army knife server can initiate single persistant
connections at 1,000,000 a second (reusing the old ones of course).
non-persistant connects happen at 1,000 times a second.  Most of my
scripts run in 1/10th of a second or so, so the 1/1000th used to connect
is noise to me.

If you are going to use persistant connections, it might work better to
let apache have only 20 or 40 children, which will force the apache
children to "round robin" serve the requests coming in.

This will usually work fine, since keeping the number of apache children
down keeps the number of postgresql backends down, which keeps the system
faster in terms of response time.  Turn keep alive down to something short
like 10 seconds, or just turn it off, as keep alive doesn't really save
all that much time in apache.

Note that machine testing with 100 simo connections doesn't translate
directly to 100 users.  Generally, x simos usually represents about 10 to
20 x users, since users don't click buttons all that fast.  so an apache
configured by 40 max children should handle 100 to 200 users with no
problem.

On Tue, 8 Apr 2003, Matthew Nuzum wrote:

> Hello all,
>
> I've been the lead developer of a successful (so-far) web application.
> Currently we run the database and the web application on the same server,
> but it will soon be necessary to split them and add some more web servers.
>
> I'm not counting on using replication any time soon, so I want to choose the
> hardware platform that will meet my needs for some time.  Maybe you can give
> some suggestions...
>
> My application is written in PHP and relies on the apache web server.  We
> use persistent db connections, so it appears (correct me if I'm wrong) that
> every apache child process gets one connection to the database.  If my
> MaxClients is 150, each web server will be able to make up to 150 db
> connections.  I'd like to play with that number a little bit once I get the
> webserver off of the db server.  I feel that I could handle a greater number
> of Clients, so let us say that I have up to 200 connections per server.
>
> I'd like to have room to grow, so let's also say that I go to 20 web servers
> for a total of 4000 connections.  (I'd probably like to go even higher, so
> consider this our starting point)
>
> With a couple dozen active accounts and a lot of test data, my current
> database is equiv to about 100 active accounts.  Its current disk space
> consumption is:
> data # du --max-depth=2
> 3656    ./base/1
> 3656    ./base/16975
> 4292    ./base/95378
> 177824  ./base/200371
> 189432  ./base
> 144     ./global
> 82024   ./pg_xlog
> 2192    ./pg_clog
> 273836  .
>
> This is not optimized and there is a lot of old data, but to be safe, maybe
> we should assume that each account uses 4 MB of disk space in the db,
> counting indexes, tables and etc.  I'd like to scale to 15,000 - 25,000
> accounts, but I doubt that will be feasible at my budget level.  (Also,
> there is a lot of optimizing to do, so it won't surprise me if this 4MB
> number is more like 2MB or even less)
>
> I'm not as concerned with disk subsystem or layout at the moment.  I've seen
> a lot of good documentation (especially from Bruce Momjian, thanks!) on this
> subject.  I'm mostly concerned with choosing the platform that's going to
> allow the scalability I need.
>
> Currently I'm most experienced in Linux, especially RedHat.  I'm "certified"
> on SCO Openserver (5.x) and I've played with Irix, OSF/1 (I don't think it's
> called that anymore), Free BSD (3.x) and Solaris (2.x).  I'm most
> comfortable with Linux, but I'm willing to use a different platform if it
> will be beneficial.  I've heard that Solaris on the Sparc platform is
> capable of addressing larger amounts of RAM than Linux on Intel does.  I
> don't know if that's true or if that has bearing, but I'd like to hear your
> opinions.
>
> My budget is going to be between (US) $5,000 and $10,000 and I'd like to
> stay under $7,000.  I'm a major bargain hunter, so I shop e-bay a lot and
> here are some samplings that I think may be relevant for discussion:
>
> SUN (I'm not an expert in this, advice is requested)
> ----------------------------------------------------
> SUN ENTERPRISE 4500 8x400 Mhz 4MB Cache CPUs 8GB RAM no hard drives ~$6,000
> Sun E3500 - 8 x 336MHz 4MB Cache CPUs 4GB RAM 8 x 9.1GB FC disks ~$600.00
> Any other suggestions?
>
> INTEL (I'm much more familiar with this area)
> ----------------------------------------------------
> Compaq DL580 4x700 MHz 2MB Cache CPUs 4GB RAM (16GB Max) HW Raid w/ 64MB
> Cache ~$6000
> IBM Netfinity 7100 4x500 MHz 1MB Cache CPUs up to (16GB Max) HW Raid
> Dell PowerEdge 8450 8x550 2M Cache CPUS 4GB (32GB Max) HS RAID w/ 16MB Cache
> ~$4,500
> Any other suggestions?
>
> Any other hardware platforms I should consider?
>
> Finally, and I know this sounds silly, but I don't have my own data center,
> so size is something I need to take into consideration.  I pay for data
> center space by the physical size of my servers.  My priorities are
> Performance, Reasonable amount of scalability (as outlined above) and
> finally physical size.
>
> Thanks for taking the time to read this and for any assistance you can give,
>
> Matthew Nuzum
> www.bearfruit.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: choosing the right platform

From

"Matthew Nuzum"

Date:

09 April 2003, 13:14:00

> Anecdotal responses are:
>
> Solaris is *bad* for PostgreSQL, due to a high per-process overhead.
> Universal opinion on this list has been that Solaris is optimized for
> multi-threaded applications, not multi-process applications, and postgres
> is
> the latter.
>
> *BSD has a *lot* of fans on the PGSQL lists, many of whom claim
> significantly
> better performance than Linux, mostly due to better filesystem I/O.
>
> Linux is used heavily by a lot of PostgreSQL users.  I have yet to see
> anyone
> provide actual Linux vs. BSD statistics, though ... something we hope to
> do.
...
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco

Thanks for the reply. Three things come to mind:

About the list archives...

I read through the entire archive at
http://archives.postgresql.org/pgsql-performance/ and didn't see much talk
on the subject.  It only goes back 8 months though, so I don't know if there
is another archive that is more comprehensive...

Also,

I'm glad to hear your comments about Solaris, I'm really most comfortable
with Linux and I think I can pick up BSD pretty easily.

About the Intel platform though,

It's only been pretty recently (relatively speaking) that servers based on
IA32 architecture have had support for greater than 2GB of RAM.  I've heard
talk about problems with applications that require more than 2GB.  I do
believe that my tables will become larger than this, and the way I
understand it, sort mem works best when the tables can be loaded completely
in RAM.

I don't suspect that individual tables will be 2GB, but that the size of all
tables combined will be.  If there is a limitation on the largest chunk of
RAM allocated to a program, will I have problems?

Finally, can someone suggest a *BSD to evaluate?  FreeBSD 4.8? 5.0? Is Apple
a good choice? (I've heard it's based on BSD Unix)

Thanks,

--
Matthew Nuzum
www.bearfruit.org
cobalt@bearfruit.org

Re: choosing the right platform

From

eric soroos

Date:

09 April 2003, 13:20:31

> Finally, can someone suggest a *BSD to evaluate?  FreeBSD 4.8? 5.0? Is Apple
> a good choice? (I've heard it's based on BSD Unix)

I wouldn't recommend OSX for deployment if you're worried about performance. The hardware availiable and the settings
totake advantage of it just aren't there yet, compared to the more established FreeBSD and Linux offerings.  

Development on a tibook is another matter, I'd recommend it to anyone with an attraction to shiny things that do real
work.  

eric

Re: choosing the right platform

From

"scott.marlowe"

Date:

09 April 2003, 14:01:14

On Wed, 9 Apr 2003, Matthew Nuzum wrote:

> I'm glad to hear your comments about Solaris, I'm really most comfortable
> with Linux and I think I can pick up BSD pretty easily.
>
> About the Intel platform though,
>
> It's only been pretty recently (relatively speaking) that servers based on
> IA32 architecture have had support for greater than 2GB of RAM.  I've heard
> talk about problems with applications that require more than 2GB.  I do
> believe that my tables will become larger than this, and the way I
> understand it, sort mem works best when the tables can be loaded completely
> in RAM.
>
> I don't suspect that individual tables will be 2GB, but that the size of all
> tables combined will be.  If there is a limitation on the largest chunk of
> RAM allocated to a program, will I have problems?

A couple more suggestions.  One is to never allocate more than 50% of your
memory to a database's shared buffers, i.e. let the OS buffer the disks en
masse, while the database should have a smaller buffer for the most recent
accesses.  This is because kernel caching is usually faster and more
efficient than the database doing it, and this becomes more an issue with
large chunks of memory, which both Linux and BSD are quite good at
caching, and postgresql, not so good.

The other is to look at Linux or BSD on 64 bit hardware (Sparc, IBM
Zseries mainframes, SGI Altix, etc...) where the one thing that's worth
being on the bleeding edge for is databases and their memory hungry ways.
:-)

Re: choosing the right platform

From

Josh Berkus

Date:

09 April 2003, 14:04:00

Matthew,

> I read through the entire archive at
> http://archives.postgresql.org/pgsql-performance/ and didn't see much talk
> on the subject.  It only goes back 8 months though, so I don't know if there
> is another archive that is more comprehensive...

Really?  There was a long-running Mac OS X vs. Solaris thread that touched on
most major platforms, about 2-3 months ago.

> I don't suspect that individual tables will be 2GB, but that the size of all
> tables combined will be.  If there is a limitation on the largest chunk of
> RAM allocated to a program, will I have problems?

No.  Since PostgreSQL is a multi-process architecture, not a multi-threaded,
you only need enough RAM per process to load the current largest query.

Plus, in my experience, Disk I/O issues are vastly more important than RAM in
database performance.   You're better off spending money on really fast disks
in Linux RAID or really good hardware RAID 1+0 ....

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: choosing the right platform

From

"Jim C. Nasby"

Date:

09 April 2003, 19:45:13

On Wed, Apr 09, 2003 at 01:13:46PM -0400, Matthew Nuzum wrote:
> Finally, can someone suggest a *BSD to evaluate?  FreeBSD 4.8? 5.0? Is Apple
> a good choice? (I've heard it's based on BSD Unix)

FreeBSD has 3 different branches:

-current:
This is bleeding edge. Definitely need to be careful with this one, and
it's not recommended for production.

-stable:
This is still a 'live' branch that any FBSD coder can (generally) commit
to, but they are far more careful about breaking this branch. Not as
stable as a release branch, but it's probably suitable for production so
long as you're careful to test things.

release branches:
Every time an official release is done (ie: 4.8), a branch is created.
The only code committed to these branches are security patches and fixes
for very serious bugs. These branches are extremely stable.

5.0 is the first release after several years of development in -current.
It incorporates some major changes designed to allow the kernel to run
multi-threaded. However, unlike what usually happens, 5.0 is not
considered to be -stable yet. First, this is still very new code;
second, I believe there's some performance issues that are still being
addressed. The intention is that 5.1 will be the first -stable release
of the 5.x code.

Because you're looking for something that's production ready, you
probably want 4.8 (cvs tag RELENG_4_8). However, if you don't plan to
hit production until late this year (when 5.1 should be out), you might
want to try 5.0.

Far more info is available at http://www.freebsd.org/releng/index.html

BTW, I've heard of many, many companies moving their Oracle installs
from Sun to RS/6000 because RS/6000's typically need 1/2 the processors
that Sun does for a given load. If you're going to look at big-iron,
RS/6000 is definitely worth a look if you see anything.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: choosing the right platform

From

"Jim C. Nasby"

Date:

09 April 2003, 19:47:49

On Wed, Apr 09, 2003 at 11:55:56AM -0600, scott.marlowe wrote:
> A couple more suggestions.  One is to never allocate more than 50% of your
> memory to a database's shared buffers, i.e. let the OS buffer the disks en
> masse, while the database should have a smaller buffer for the most recent
> accesses.  This is because kernel caching is usually faster and more
> efficient than the database doing it, and this becomes more an issue with
> large chunks of memory, which both Linux and BSD are quite good at
> caching, and postgresql, not so good.

That seems odd... shouldn't pgsql be able to cache information better
since it would be cached in whatever format is best for it, rather than
the raw page format (or maybe that is the best format). There's also the
issue of having to go through more layers of software if you're relying
on the OS caching. All the tuning info I've seen for every other
database I've worked with specifically recommends giving the database as
much memory as you possibly can, the theory being that it will do a much
better job of caching than the OS will.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: choosing the right platform

From

"Jim C. Nasby"

Date:

09 April 2003, 19:58:46

On Wed, Apr 09, 2003 at 10:51:34AM -0600, scott.marlowe wrote:
> Secondly, look carefully at using persistant connections in large numbers.
>
> While persistant connections DO represent a big savings in connect time,
> the savings are lost in the noise of many PHP applications.
>
> i.e. my dual PIII swiss army knife server can initiate single persistant
> connections at 1,000,000 a second (reusing the old ones of course).
> non-persistant connects happen at 1,000 times a second.  Most of my
> scripts run in 1/10th of a second or so, so the 1/1000th used to connect
> is noise to me.

My $0.02 from my experience with Sybase and DB2:
It's not the connection *time* that's an issue, it's the amount of
resources (mostly memory) used by each database connection. Each db2
connection to a database uses 4-8 meg of memory; on my pgsql system,
each connection appears to be using about 4M. This is the resident set,
which I believe indicates memory that basically can't be shared. All
this memory is memory that can't be used for buffering/caching; on a
system with a hundred connections, it can really start to add up.

If your PHP is written in such a way that it does all the database work
in one section of code, and only holds a connection to the database in
that one section, then you can potentially have a lot of apache
processes for each database connection.

Of course, all this holds true wether you're using pooling or not. How
much pooling will help depends on how expensive it is for the *database*
to handle each new connection request, and how your code is written.
Since it's often not possible to put all you database code in one place
like I mentioned above, an alternative is to connect right before you do
an operation, and disconnect as soon as you're done. This doesn't add
much (if any) expense if you're using pooling, but it's a very different
story if you're not using pooling.

> If you are going to use persistant connections, it might work better to
> let apache have only 20 or 40 children, which will force the apache
> children to "round robin" serve the requests coming in.
>
> This will usually work fine, since keeping the number of apache children
> down keeps the number of postgresql backends down, which keeps the system
> faster in terms of response time.  Turn keep alive down to something short
> like 10 seconds, or just turn it off, as keep alive doesn't really save
> all that much time in apache.

Very important advice. Generally, once you push a database past a
certain point, your performance degrades severely as the database
thrashes about trying to answer all the pending queries.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Caching (was Re: choosing the right platform)

From

Tom Lane

Date:

09 April 2003, 20:21:16

"Jim C. Nasby" <jim@nasby.net> writes:
> That seems odd... shouldn't pgsql be able to cache information better
> since it would be cached in whatever format is best for it, rather than
> the raw page format (or maybe that is the best format). There's also the
> issue of having to go through more layers of software if you're relying
> on the OS caching. All the tuning info I've seen for every other
> database I've worked with specifically recommends giving the database as
> much memory as you possibly can, the theory being that it will do a much
> better job of caching than the OS will.

There are a number of reasons why that's a dubious policy for PG (I
won't take a position on whether these apply to other databases...)

One is that because we sit on top of the OS' filesystem, we can't
(portably) prevent the OS from caching blocks.  So it's quite easy to
get into a situation where the same data is cached twice, once in PG
buffers and once in kernel disk cache.  That's clearly a waste of RAM
however you slice it, and it's worst when you set the PG shared buffer
size to be about half of available RAM.  You can minimize the
duplication by skewing the allocation one way or the other: either set
PG's allocation relatively small, relying heavily on the OS to do the
caching; or make PG's allocation most of RAM and hope to squeeze out
the OS' cache.  There are partisans for both approaches on this list.
I lean towards the first policy because I think that starving the kernel
for RAM is a bad idea.  (Especially if you run on Linux, where this
policy tempts the kernel to start kill -9'ing random processes ...)

Another reason is that PG uses a simplistic fixed-number-of-buffers
internal cache, and therefore it can't adapt on-the-fly to varying
memory pressure, whereas the kernel can and will give up disk cache
space to make room when it's needed for processes.  Since PG isn't
even aware of the total memory pressure on the system as a whole,
it couldn't do as good a job of trading off cache vs process workspace
as the kernel can do, even if we had a variable-size cache scheme.

A third reason is that on many (most?) Unixen, SysV shared memory is
subject to swapping, and the bigger you make the shared_buffer arena,
the more likely it gets that some of the arena will be touched seldom
enough to make it a candidate for swapping.  A disk buffer that gets
swapped to disk is worse than useless (if it's dirty, the swapping
is downright counterproductive, since an extra read and write cycle
will be needed before the data can make it to its rightful place).

PG is *not* any smarter about the usage patterns of its disk buffers
than the kernel is; it uses a simple LRU algorithm that is surely no
brighter than what the kernel uses.  (We have looked at smarter buffer
recycling rules, but failed to see any performance improvement.)  So the
notion that PG can do a better job of cache management than the kernel
is really illusory.  About the only advantage you gain from having data
directly in PG buffers rather than kernel buffers is saving the CPU
effort needed to move data across the userspace boundary --- which is
not zero, but it's sure a lot less than the time spent for actual I/O.

So my take on it is that you want shared_buffers fairly small, and let
the kernel do the bulk of the heavy lifting for disk cache.  That's what
it does for a living, so let it do what it does best.  You only want
shared_buffers big enough so you don't spend too many CPU cycles shoving
data back and forth between PG buffers and kernel disk cache.  The
default shared_buffers setting of 64 is surely too small :-(, but my
feeling is that values in the low thousands are enough to get past the
knee of that curve in most cases.

            regards, tom lane

Re: Caching (was Re: choosing the right platform)

From

"Matthew Nuzum"

Date:

09 April 2003, 21:15:38

Thanks for all the feedback, this is very informative.

My current issues that I'm still not clear on, are:
* Is the ia32 architecture going to impose uncomfortable limits on my
application?  I'm seeing lots of confirmation that this platform, regardless
of the OS is going to limit me to less the 4GB of memory allocated to a
single application (i.e. http://www.spack.org/index.cgi/LinuxRamLimits).
This may or may not be an issue because: (note that these are questions, not
statements)
** Postgres is multi-process, not multi-threaded (?)
** It's better to not use huge amount of sort-mem but instead let the OS do
the caching (?)
** My needs are really not going to be as big as I think they are if I
manage the application/environment correctly (?)

Here are some of the performance suggestions I've heard, please, if I
mis-understood, could you help me get clarity?
* It's better to run fewer apache children and turn off persistent
connections (I had suggested 200 children per server, someone else suggested
40)
* FreeBSD is going to provide a better file system than Linux (because Linux
only supports large files on journaling filesystems which impose extra over
head) (this gleaned from this conversation and previous threads in archives)
* Running Linux or *BSD on a 64 bit platform can alleviate some potential
RAM limitations (if there are truly going to be limitations).  If this is
so, I've heard suggestions for Itanium, Sparc and RS/6000.  Maybe someone
can give some more info on these, here are my immediate thoughts: I've heard
that the industry as a whole has not yet warmed up to Itanium.  I can't
afford the newest Sparc Servers, so I'd need to settle with a previous
generation if I went that route, any problems with that?  I know nothing
about the RS/6000 servers (I did see one once though :-), does linux|*BSD
run well on them and any suggestions for series/models I should look at?

Finally, some specific questions,
What's the max number of connections someone has seen on a database server?
What type of hardware was it?  How much RAM did postgres use?

Thanks again,

--
Matthew Nuzum
www.bearfruit.org
cobalt@bearfruit.org

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Wednesday, April 09, 2003 8:21 PM
> To: jim@nasby.net
> Cc: scott.marlowe; Matthew Nuzum; 'Josh Berkus'; 'Pgsql-Performance'
> Subject: Caching (was Re: [PERFORM] choosing the right platform)
>
> "Jim C. Nasby" <jim@nasby.net> writes:
> > That seems odd... shouldn't pgsql be able to cache information better
> > since it would be cached in whatever format is best for it, rather than
> > the raw page format (or maybe that is the best format). There's also the
> > issue of having to go through more layers of software if you're relying
> > on the OS caching. All the tuning info I've seen for every other
> > database I've worked with specifically recommends giving the database as
> > much memory as you possibly can, the theory being that it will do a much
> > better job of caching than the OS will.
>
> There are a number of reasons why that's a dubious policy for PG (I
> won't take a position on whether these apply to other databases...)
>
> One is that because we sit on top of the OS' filesystem, we can't
> (portably) prevent the OS from caching blocks.  So it's quite easy to
> get into a situation where the same data is cached twice, once in PG
> buffers and once in kernel disk cache.  That's clearly a waste of RAM
> however you slice it, and it's worst when you set the PG shared buffer
> size to be about half of available RAM.  You can minimize the
> duplication by skewing the allocation one way or the other: either set
> PG's allocation relatively small, relying heavily on the OS to do the
> caching; or make PG's allocation most of RAM and hope to squeeze out
> the OS' cache.  There are partisans for both approaches on this list.
> I lean towards the first policy because I think that starving the kernel
> for RAM is a bad idea.  (Especially if you run on Linux, where this
> policy tempts the kernel to start kill -9'ing random processes ...)
>
> Another reason is that PG uses a simplistic fixed-number-of-buffers
> internal cache, and therefore it can't adapt on-the-fly to varying
> memory pressure, whereas the kernel can and will give up disk cache
> space to make room when it's needed for processes.  Since PG isn't
> even aware of the total memory pressure on the system as a whole,
> it couldn't do as good a job of trading off cache vs process workspace
> as the kernel can do, even if we had a variable-size cache scheme.
>
> A third reason is that on many (most?) Unixen, SysV shared memory is
> subject to swapping, and the bigger you make the shared_buffer arena,
> the more likely it gets that some of the arena will be touched seldom
> enough to make it a candidate for swapping.  A disk buffer that gets
> swapped to disk is worse than useless (if it's dirty, the swapping
> is downright counterproductive, since an extra read and write cycle
> will be needed before the data can make it to its rightful place).
>
> PG is *not* any smarter about the usage patterns of its disk buffers
> than the kernel is; it uses a simple LRU algorithm that is surely no
> brighter than what the kernel uses.  (We have looked at smarter buffer
> recycling rules, but failed to see any performance improvement.)  So the
> notion that PG can do a better job of cache management than the kernel
> is really illusory.  About the only advantage you gain from having data
> directly in PG buffers rather than kernel buffers is saving the CPU
> effort needed to move data across the userspace boundary --- which is
> not zero, but it's sure a lot less than the time spent for actual I/O.
>
> So my take on it is that you want shared_buffers fairly small, and let
> the kernel do the bulk of the heavy lifting for disk cache.  That's what
> it does for a living, so let it do what it does best.  You only want
> shared_buffers big enough so you don't spend too many CPU cycles shoving
> data back and forth between PG buffers and kernel disk cache.  The
> default shared_buffers setting of 64 is surely too small :-(, but my
> feeling is that values in the low thousands are enough to get past the
> knee of that curve in most cases.
>
>             regards, tom lane
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: Caching (was Re: choosing the right platform)

From

Jean-Luc Lachance

Date:

10 April 2003, 10:26:43

Tom,

What appends when PG scans a table that is is too big to fit in the
cache?
Won't the whole cache get trashed and swapped off to disk?
Shouldn't there be a way to lock some tables in PG cache?
Who about caracterizing some of the RAM like: scan, index, small
frequently used tables.

JLL

Tom Lane wrote:
> [...]
> PG is *not* any smarter about the usage patterns of its disk buffers
> than the kernel is; it uses a simple LRU algorithm that is surely no
> brighter than what the kernel uses.  (We have looked at smarter buffer
> recycling rules, but failed to see any performance improvement.)  So the
> notion that PG can do a better job of cache management than the kernel
> is really illusory.  About the only advantage you gain from having data
> directly in PG buffers rather than kernel buffers is saving the CPU
> effort needed to move data across the userspace boundary --- which is
> not zero, but it's sure a lot less than the time spent for actual I/O.
>
> So my take on it is that you want shared_buffers fairly small, and let
> the kernel do the bulk of the heavy lifting for disk cache.  That's what
> it does for a living, so let it do what it does best.  You only want
> shared_buffers big enough so you don't spend too many CPU cycles shoving
> data back and forth between PG buffers and kernel disk cache.  The
> default shared_buffers setting of 64 is surely too small :-(, but my
> feeling is that values in the low thousands are enough to get past the
> knee of that curve in most cases.

Re: Caching (was Re: choosing the right platform)

From

Tom Lane

Date:

10 April 2003, 10:40:18

Jean-Luc Lachance <jllachan@nsd.ca> writes:
> Shouldn't there be a way to lock some tables in PG cache?

In my opinion, no.  I do not think a manual locking feature could
possibly be used effectively.  It could very easily be abused to
decrease net performance, though :-(

It does seem that a smarter buffer management algorithm would be a good
idea, but past experiments have failed to show any measurable benefit.
Perhaps those experiments were testing the wrong conditions.  I'd still
be happy to see LRU(k) or some such method put in, if someone can prove
that it actually does anything useful for us.  (As best I recall, I only
tested LRU-2 with pgbench.  Perhaps Josh's benchmarking project will
offer a wider variety of interesting scenarios.)

            regards, tom lane

Re: choosing the right platform

From

"scott.marlowe"

Date:

10 April 2003, 12:49:03

On Wed, 9 Apr 2003, Jim C. Nasby wrote:

> On Wed, Apr 09, 2003 at 10:51:34AM -0600, scott.marlowe wrote:
> > Secondly, look carefully at using persistant connections in large numbers.
> >
> > While persistant connections DO represent a big savings in connect time,
> > the savings are lost in the noise of many PHP applications.
> >
> > i.e. my dual PIII swiss army knife server can initiate single persistant
> > connections at 1,000,000 a second (reusing the old ones of course).
> > non-persistant connects happen at 1,000 times a second.  Most of my
> > scripts run in 1/10th of a second or so, so the 1/1000th used to connect
> > is noise to me.
>
> My $0.02 from my experience with Sybase and DB2:
> It's not the connection *time* that's an issue, it's the amount of
> resources (mostly memory) used by each database connection. Each db2
> connection to a database uses 4-8 meg of memory;

Agreed.

> on my pgsql system,
> each connection appears to be using about 4M. This is the resident set,
> which I believe indicates memory that basically can't be shared. All
> this memory is memory that can't be used for buffering/caching; on a
> system with a hundred connections, it can really start to add up.

If I run "select * from logs" from two different psql sessions on my
backup box hitting my main box (psql would hold the result set and throw
the results off if I ran it on the main box) I get this output from top:

No (pgsql) load:

  8:58am  up 9 days, 22:43,  4 users,  load average: 0.65, 0.54, 0.35
169 processes: 168 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.1% user,  0.1% system,  0.0% nice, 99.1% idle
CPU1 states: 32.1% user,  3.2% system,  0.0% nice, 64.0% idle
Mem:  1543980K av, 1049864K used,  494116K free,  265928K shrd,   31404K buff
Swap: 2048208K av,       0K used, 2048208K free                  568600K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
10241 postgres   9   0  4216 4216  4136 S     0.0  0.2   0:05 postmaster
10242 postgres   9   0  4444 4444  4156 S     0.0  0.2   0:00 postmaster
10243 postgres   9   0  4812 4812  4148 S     0.0  0.3   0:00 postmaster

1 psql select *:
  9:03am  up 9 days, 22:48,  2 users,  load average: 0.71, 0.71, 0.46
166 processes: 165 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
CPU1 states:  0.1% user,  2.0% system,  0.0% nice, 97.3% idle
Mem:  1543980K av, 1052188K used,  491792K free,  265928K shrd,   32036K buff
Swap: 2048208K av,       0K used, 2048208K free                  570656K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
10241 postgres  10   0  4216 4216  4136 S     0.0  0.2   0:05 postmaster
10242 postgres   9   0  4448 4448  4156 S     0.0  0.2   0:00 postmaster
10243 postgres   9   0  4812 4812  4148 S     0.0  0.3   0:00 postmaster
18026 postgres   9   0  236M 236M  235M S     0.0 15.6   0:12 postmaster
18035 postgres  10   0  5832 5732  5096 S     0.0  0.3   0:00 postmaster

2 psql select *:
 9:03am  up 9 days, 22:49,  2 users,  load average: 0.58, 0.66, 0.45
166 processes: 165 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  0.0% user,  2.2% system,  0.0% nice, 97.2% idle
CPU1 states:  0.0% user,  0.0% system,  0.0% nice, 100.0% idle
Mem:  1543980K av, 1053152K used,  490828K free,  265928K shrd,   32112K buff
Swap: 2048208K av,       0K used, 2048208K free                  570684K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
10241 postgres   8   0  4216 4216  4136 S     0.0  0.2   0:05 postmaster
10242 postgres   9   0  4448 4448  4156 S     0.0  0.2   0:00 postmaster
10243 postgres   9   0  4812 4812  4148 S     0.0  0.3   0:00 postmaster
18026 postgres   9   0  236M 236M  235M S     0.0 15.6   0:12 postmaster
18035 postgres   9   0  236M 236M  235M S     0.0 15.6   0:12 postmaster

The difference between SIZE and SHARE is the delta, which is only
something like 3 or 4 megs for the initial select * from logs, but the
second one is only 1 meg.  On average, the actual increase in memory usage
for postgresql isn't that great, usually about 1 meg.

Running out of memory isn't really a problem with connections<=200 and 1
gig of ram, as long as sort_mem isn't too high.  I/O contention is the
killer at that point, as is CPU load.

Re: choosing the right platform

From

"scott.marlowe"

Date:

10 April 2003, 12:58:05

On Wed, 9 Apr 2003, Jim C. Nasby wrote:

> On Wed, Apr 09, 2003 at 11:55:56AM -0600, scott.marlowe wrote:
> > A couple more suggestions.  One is to never allocate more than 50% of your
> > memory to a database's shared buffers, i.e. let the OS buffer the disks en
> > masse, while the database should have a smaller buffer for the most recent
> > accesses.  This is because kernel caching is usually faster and more
> > efficient than the database doing it, and this becomes more an issue with
> > large chunks of memory, which both Linux and BSD are quite good at
> > caching, and postgresql, not so good.
>
> That seems odd... shouldn't pgsql be able to cache information better
> since it would be cached in whatever format is best for it, rather than
> the raw page format (or maybe that is the best format).

Yes and no.  The problem isn't that the data is closer to postgresql in
it's buffers versus further away in kernel buffers, it's that postgresql's
caching algorhythm isn't performance tweaked for very large settings, it's
performance tweaked to provide good performance on smaller machines, with
say 4 or 16 Megs of shared buffers.  Handling large buffers requires a
different approach to handling small ones, and the kernel is optimized in
that direction.

Also, the kernel in most Oses, i.e. Linux and BSD tends to use "spare ram"
with abandon as cache memory, so if you've got 4 gigs of ram, with 200
Megs set aside for postgresql, it's quite likely that the kernel cache can
hold ALL your dataset for you once it's been read in once.  So, the data
is already cached once.  Caching it again in Postgresql only gains a
little, since the speed difference of postgresql shared buffer / cache and
kernel caches is very small.  However, the speed going to the hard drive
is much slower.

What you don't want is a postgresql cache that's bigger (on average) than
the kernel cache, since the kernel cache will then be "thrashing" when you
access information not currently in either cache.  I.e. postgresql becomes
your only cache, and kernel caching stops working for you and becomes just
overhead, since you never get anything from it if it's too small to cache
something long enough to be used again.

> There's also the
> issue of having to go through more layers of software if you're relying
> on the OS caching. All the tuning info I've seen for every other
> database I've worked with specifically recommends giving the database as
> much memory as you possibly can, the theory being that it will do a much
> better job of caching than the OS will.

That's old school thinking.  There was a day when kernel caching was much
slower, and writing directly to your devices in a raw mode was the only
way to ensure good performance.  Nowadays, most modern Unix kernels and
their file systems are a match for most database needs.  heck, with some
storage systems, the performance of the file system is just not really an
issue, it's the bandwidth of the connector you use.

Note that this is a good thing (TM) since it frees the postgresql
development team to do other things than worry about caching 1 gig of
data.

Re: choosing the right platform

From

Tom Lane

Date:

10 April 2003, 13:17:53

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Note that this is a good thing (TM) since it frees the postgresql
> development team to do other things than worry about caching 1 gig of
> data.

Yeah.  I think this is one very fundamental difference of design
philosophy between Postgres and more-traditional databases such as
Oracle.  We prefer to let the kernel and filesystem do their jobs,
and we assume they will do them well; whereas Oracle wants to bypass
if not replace the kernel and filesystem.  Partly this is a matter of
the PG project not having the manpower to replace those layers.  But
I believe the world has changed in the last twenty years, and the Oracle
approach is now obsolete: it's now costing them design and maintenance
effort that isn't providing much return.  Modern kernels and filesystems
are *good*, and it's not easy to do better.  We should focus our efforts
on functionality that doesn't just duplicate what the OS can do.

This design approach shows up in other areas too.  For instance, in
another thread I was just pointing out that there is no need for our
frontend/backend protocol to solve communication problems like dropped
or duplicated packets; TCP does that perfectly well already.

            regards, tom lane

Re: Caching (was Re: choosing the right platform)

From

Jean-Luc Lachance

Date:

10 April 2003, 14:59:32

How can we solve the problem of cache trashing when scanning large
tables?

Tom Lane wrote:
>
> Jean-Luc Lachance <jllachan@nsd.ca> writes:
> > Shouldn't there be a way to lock some tables in PG cache?
>
> In my opinion, no.  I do not think a manual locking feature could
> possibly be used effectively.  It could very easily be abused to
> decrease net performance, though :-(
>
> It does seem that a smarter buffer management algorithm would be a good
> idea, but past experiments have failed to show any measurable benefit.
> Perhaps those experiments were testing the wrong conditions.  I'd still
> be happy to see LRU(k) or some such method put in, if someone can prove
> that it actually does anything useful for us.  (As best I recall, I only
> tested LRU-2 with pgbench.  Perhaps Josh's benchmarking project will
> offer a wider variety of interesting scenarios.)
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: choosing the right platform

From

"Jim C. Nasby"

Date:

22 April 2003, 04:30:58

On Thu, Apr 10, 2003 at 10:42:35AM -0600, scott.marlowe wrote:
> The difference between SIZE and SHARE is the delta, which is only
> something like 3 or 4 megs for the initial select * from logs, but the
> second one is only 1 meg.  On average, the actual increase in memory usage
> for postgresql isn't that great, usually about 1 meg.
>
> Running out of memory isn't really a problem with connections<=200 and 1
> gig of ram, as long as sort_mem isn't too high.  I/O contention is the
> killer at that point, as is CPU load.

Except you should consider what you could be doing with that 200M, ie:
caching data. Even something as small as 1M per connection starts to add
up.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"