Thread: Caching (was Re: choosing the right platform)

Caching (was Re: choosing the right platform)

From
"Matthew Nuzum"
Date:
Thanks for all the feedback, this is very informative.

My current issues that I'm still not clear on, are:
* Is the ia32 architecture going to impose uncomfortable limits on my
application?  I'm seeing lots of confirmation that this platform, regardless
of the OS is going to limit me to less the 4GB of memory allocated to a
single application (i.e. http://www.spack.org/index.cgi/LinuxRamLimits).
This may or may not be an issue because: (note that these are questions, not
statements)
** Postgres is multi-process, not multi-threaded (?)
** It's better to not use huge amount of sort-mem but instead let the OS do
the caching (?)
** My needs are really not going to be as big as I think they are if I
manage the application/environment correctly (?)

Here are some of the performance suggestions I've heard, please, if I
mis-understood, could you help me get clarity?
* It's better to run fewer apache children and turn off persistent
connections (I had suggested 200 children per server, someone else suggested
40)
* FreeBSD is going to provide a better file system than Linux (because Linux
only supports large files on journaling filesystems which impose extra over
head) (this gleaned from this conversation and previous threads in archives)
* Running Linux or *BSD on a 64 bit platform can alleviate some potential
RAM limitations (if there are truly going to be limitations).  If this is
so, I've heard suggestions for Itanium, Sparc and RS/6000.  Maybe someone
can give some more info on these, here are my immediate thoughts: I've heard
that the industry as a whole has not yet warmed up to Itanium.  I can't
afford the newest Sparc Servers, so I'd need to settle with a previous
generation if I went that route, any problems with that?  I know nothing
about the RS/6000 servers (I did see one once though :-), does linux|*BSD
run well on them and any suggestions for series/models I should look at?

Finally, some specific questions,
What's the max number of connections someone has seen on a database server?
What type of hardware was it?  How much RAM did postgres use?

Thanks again,

--
Matthew Nuzum
www.bearfruit.org
cobalt@bearfruit.org

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Wednesday, April 09, 2003 8:21 PM
> To: jim@nasby.net
> Cc: scott.marlowe; Matthew Nuzum; 'Josh Berkus'; 'Pgsql-Performance'
> Subject: Caching (was Re: [PERFORM] choosing the right platform)
>
> "Jim C. Nasby" <jim@nasby.net> writes:
> > That seems odd... shouldn't pgsql be able to cache information better
> > since it would be cached in whatever format is best for it, rather than
> > the raw page format (or maybe that is the best format). There's also the
> > issue of having to go through more layers of software if you're relying
> > on the OS caching. All the tuning info I've seen for every other
> > database I've worked with specifically recommends giving the database as
> > much memory as you possibly can, the theory being that it will do a much
> > better job of caching than the OS will.
>
> There are a number of reasons why that's a dubious policy for PG (I
> won't take a position on whether these apply to other databases...)
>
> One is that because we sit on top of the OS' filesystem, we can't
> (portably) prevent the OS from caching blocks.  So it's quite easy to
> get into a situation where the same data is cached twice, once in PG
> buffers and once in kernel disk cache.  That's clearly a waste of RAM
> however you slice it, and it's worst when you set the PG shared buffer
> size to be about half of available RAM.  You can minimize the
> duplication by skewing the allocation one way or the other: either set
> PG's allocation relatively small, relying heavily on the OS to do the
> caching; or make PG's allocation most of RAM and hope to squeeze out
> the OS' cache.  There are partisans for both approaches on this list.
> I lean towards the first policy because I think that starving the kernel
> for RAM is a bad idea.  (Especially if you run on Linux, where this
> policy tempts the kernel to start kill -9'ing random processes ...)
>
> Another reason is that PG uses a simplistic fixed-number-of-buffers
> internal cache, and therefore it can't adapt on-the-fly to varying
> memory pressure, whereas the kernel can and will give up disk cache
> space to make room when it's needed for processes.  Since PG isn't
> even aware of the total memory pressure on the system as a whole,
> it couldn't do as good a job of trading off cache vs process workspace
> as the kernel can do, even if we had a variable-size cache scheme.
>
> A third reason is that on many (most?) Unixen, SysV shared memory is
> subject to swapping, and the bigger you make the shared_buffer arena,
> the more likely it gets that some of the arena will be touched seldom
> enough to make it a candidate for swapping.  A disk buffer that gets
> swapped to disk is worse than useless (if it's dirty, the swapping
> is downright counterproductive, since an extra read and write cycle
> will be needed before the data can make it to its rightful place).
>
> PG is *not* any smarter about the usage patterns of its disk buffers
> than the kernel is; it uses a simple LRU algorithm that is surely no
> brighter than what the kernel uses.  (We have looked at smarter buffer
> recycling rules, but failed to see any performance improvement.)  So the
> notion that PG can do a better job of cache management than the kernel
> is really illusory.  About the only advantage you gain from having data
> directly in PG buffers rather than kernel buffers is saving the CPU
> effort needed to move data across the userspace boundary --- which is
> not zero, but it's sure a lot less than the time spent for actual I/O.
>
> So my take on it is that you want shared_buffers fairly small, and let
> the kernel do the bulk of the heavy lifting for disk cache.  That's what
> it does for a living, so let it do what it does best.  You only want
> shared_buffers big enough so you don't spend too many CPU cycles shoving
> data back and forth between PG buffers and kernel disk cache.  The
> default shared_buffers setting of 64 is surely too small :-(, but my
> feeling is that values in the low thousands are enough to get past the
> knee of that curve in most cases.
>
>             regards, tom lane
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


Re: Caching (was Re: choosing the right platform)

From
Josh Berkus
Date:
Matthew,

> ** Postgres is multi-process, not multi-threaded (?)

Correct.

> ** It's better to not use huge amount of sort-mem but instead let the OS do
> the caching (?)

That's "don't use a huge amount of *shared_buffers*".  Sort_mem is a different
setting.  However, I have never seen a database use more than 32mb sort mem
in a single process, so I don't think the 2GB limit will hurt you much ...

> ** My needs are really not going to be as big as I think they are if I
> manage the application/environment correctly (?)

Your needs *per process*.   Also, PostgreSQL is not as much of a consumer of
RAM as it is a consumer of disk I/O.

> * FreeBSD is going to provide a better file system than Linux (because
> Linux only supports large files on journaling filesystems which impose
> extra over head) (this gleaned from this conversation and previous threads
> in archives)

No, the jury is still out on this one.   ReiserFS is optimized for small
files, and I've done well with it although some posters report stability
problems, though all second-hand.   We hope to test this sometime in the
upcoming months.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Caching (was Re: choosing the right platform)

From
Ron Johnson
Date:
On Wed, 2003-04-09 at 20:16, Matthew Nuzum wrote:
> Thanks for all the feedback, this is very informative.
[snip]
> * Running Linux or *BSD on a 64 bit platform can alleviate some potential
> RAM limitations (if there are truly going to be limitations).  If this is
> so, I've heard suggestions for Itanium, Sparc and RS/6000.  Maybe someone
> can give some more info on these, here are my immediate thoughts: I've heard
> that the industry as a whole has not yet warmed up to Itanium.  I can't
> afford the newest Sparc Servers, so I'd need to settle with a previous
> generation if I went that route, any problems with that?  I know nothing
> about the RS/6000 servers (I did see one once though :-), does linux|*BSD
> run well on them and any suggestions for series/models I should look at?

If you want 64-bit, maybe wait for Operon, or look at Alphas.  You could
probably get a used DS20 or ES40 for a pretty good price, and Linux is
*well* supported on Alpha.  If you want something that really smokes,
and have some buck lying around, try an ES47.

--
+----------------------------------------------------------------+
| Ron Johnson, Jr.        mailto:ron.l.johnson@cox.net           |
| Jefferson, LA  USA      http://members.cox.net/ron.l.johnson   |
|                                                                |
| "A C program is like a fast dance on a newly waxed dance floor |
|  by people carrying razors."                                   |
|      Waldi Ravens                                              |
+----------------------------------------------------------------+


Re: Caching (was Re: choosing the right platform)

From
David McKain
Date:
On Wed, Apr 09, 2003 at 09:16:36PM -0400, Matthew Nuzum wrote:
> Thanks for all the feedback, this is very informative.
>
> Here are some of the performance suggestions I've heard, please, if I
> mis-understood, could you help me get clarity?
> * It's better to run fewer apache children and turn off persistent
> connections (I had suggested 200 children per server, someone else suggested
> 40)

Hi Matthew,

I'm coming in a bit late and slightly OT here, but one common Apache
solution you might want to look at is a "reverse proxy" configuration.
This works very well if there's a good proportion of static vs dynamic
content on your site - if your pages contain a lot of graphics then this
may well be the case.

To do this, you compile 2 Apache servers listening on different ports on
the same machine (or you can have them on different machines too).

Server 1 (we'll call the "front server") is just a vanilla Apache
listening on Port 80, compiled with mod_rewrite and mod_proxy but
nothing else.

Server 2 ("back server" or "heavy server") has mod_php and anything else
you need which is quite bulky (e.g. XML processing stuff, mod_perl ...)
It can listen on Port 8080 or something. Your persistent DB connections
come from Server 2.

All web requests come in to Server 1 in the normal way and Server 1
deals with static content as before. By setting up Apache rewrite rules
on Server 1, requests for *.php and other dynamic stuff can be forwarded
to Server 2 for processing. Server 2 returns its response back through
Server 1 and the end-user is oblivious to what's going on. (Server 2
and/or your firewall can be configured to allow connections only from
Server 1 too.)

It's a bit of effort to set up and does require a wee bit more
maintenance than a single server but it comes with a few nice
advantages:

* You can have a lower MaxClients setting on server 2 and hence less
  persistent DB connections and less memory used by heavy Apache modules
  and PostgreSQL instances.

* Server 1 is nice and light - no DB, low memory use (much of which is
  probably shared) - so you can set its MaxClients much higher.

* The overall impact of each dynamic page is lower as all of the
  images and stylesheets it references can be quickly dealt with by
  Server 1, rather than wasting an unnecessary wodge of memory and
  persistent DB connection.

I used this recently for transforming XML web pages into HTML using XSLT
and mod_perl on a slightly old and underpowered Solaris server and it
worked really well. Of course, YMMV!

There are lots of tutorials on setting this up on the web - the mod_perl
guide has some very handy stuff in it which ought to apply reasonably
well to PHP too:

http://perl.apache.org/docs/1.0/guide/scenario.html

Hope that might help,
David.


Re: Caching (was Re: choosing the right platform)

From
Shridhar Daithankar
Date:
On Thursday 10 April 2003 15:04, you wrote:
> On Wed, Apr 09, 2003 at 09:16:36PM -0400, Matthew Nuzum wrote:
> > Thanks for all the feedback, this is very informative.
> >
> > Here are some of the performance suggestions I've heard, please, if I
> > mis-understood, could you help me get clarity?
> > * It's better to run fewer apache children and turn off persistent
> > connections (I had suggested 200 children per server, someone else
> > suggested 40)
>
> Hi Matthew,
>
> I'm coming in a bit late and slightly OT here, but one common Apache
> solution you might want to look at is a "reverse proxy" configuration.
> This works very well if there's a good proportion of static vs dynamic
> content on your site - if your pages contain a lot of graphics then this
> may well be the case.
>
> To do this, you compile 2 Apache servers listening on different ports on

Umm..  AFAIK, if you use fastCGI, persistence of connection should be a lot
better and <self drumming on> or OAS Server, which gives you explicit control
on how much resources to allocate. </self drumming on>

 Shridhar


Re: Caching (was Re: choosing the right platform)

From
"Ron Mayer"
Date:
Short summary...

  I think sort_mem matters quite a bit (20-40%) on
  my data-warehousing applications.

  Am I doing something wrong to need so much sort_mem?

Josh wrote:
>> ** It's better to not use huge amount of sort-mem...
>
>...However, I have never seen a database use more than 32mb sort mem
>in a single process, so I don't think the 2GB limit will hurt you much ...

Do you think this is true in data warehousing applications as well?

During the ETL part of data warehousing, large sorts are often
used to get the "new" values that need to be inserted
into "dimension" tables, like this:
  INSERT INTO dimension_val (id,val)
       SELECT nextval('val_seq'),val
         FROM (SELECT DISTINCT val FROM import_table
               EXCEPT
               SELECT          val FROM dimension_val) as a;
As far as I can tell, this query typically does two sorts,
one for the distinct, and one for the except.


In a data warehouse we have here, we load about 3 million rows
each week; load time improved from about 9 to 7 hours
by breaking up such queries into expressions that only require
one sort at a time, and surrounding the expressions with
"set sort_mem=something_big" statements to give it enough
space to not hit the disk.

  SET SORT_MEM=300000;
  CREATE TEMPORARY TABLE potential_new_values AS
     SELECT DISTINCT val FROM import_table;
  ...
  SET SORT_MEM=1000;

Anyone else have similar experience, or am I doing something
wrong to need so much SORT_MEM?


    Ron



PS:

Below is an example of another real-world query from the same
reporting system that benefits from a sort_mem over 32M.
Explain analyze (below) shows a 40% improvement by having
the sort fit in memory.

10Meg and 32Meg take over 22 seconds.  100Meg takes 14.

====================================================================================================
logs2=#
logs2=#
logs2=# set sort_mem=10000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=71612.82..72838.69 rows=49035 width=17) (actual time=20315.47..22457.21 rows=2914 loops=1)
  ->  Sort  (cost=71612.82..71612.82 rows=490348 width=17) (actual time=20315.46..21351.42 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..2932.72
rows=511368
loops=1)
Total runtime: 22475.63 msec

EXPLAIN
logs2=# set sort_mem=32000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=60442.82..61668.69 rows=49035 width=17) (actual time=22657.31..24794.19 rows=2914 loops=1)
  ->  Sort  (cost=60442.82..60442.82 rows=490348 width=17) (actual time=22657.30..23714.43 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.07..3020.83
rows=511368
loops=1)
Total runtime: 24811.65 msec

EXPLAIN
logs2=# set sort_mem=100000;
SET VARIABLE
logs2=# explain analyze select distinct category from c_transaction_credit;
NOTICE:  QUERY PLAN:

Unique  (cost=60442.82..61668.69 rows=49035 width=17) (actual time=12205.19..14012.57 rows=2914 loops=1)
  ->  Sort  (cost=60442.82..60442.82 rows=490348 width=17) (actual time=12205.18..12710.16 rows=511368 loops=1)
        ->  Seq Scan on c_transaction_credit  (cost=0.00..14096.48 rows=490348 width=17) (actual time=0.08..3001.05
rows=511368
loops=1)
Total runtime: 14187.96 msec

EXPLAIN
logs2=#


Re: Caching (was Re: choosing the right platform)

From
"scott.marlowe"
Date:
On Thu, 10 Apr 2003, Ron Mayer wrote:

>
> Short summary...
>
>   I think sort_mem matters quite a bit (20-40%) on
>   my data-warehousing applications.
>
>   Am I doing something wrong to need so much sort_mem?

No.  In fact, it's not uncommon for certain queries to need WAY more sort
memory than most queries.  The mistake that gets made is setting sort_mem
to something like 32 meg for every sort.  There are many "sorts" on my
machine that are coming from well ordered data, and don't really need to
be done in memory to be reasonably fast.  Those can run fine with 8 meg
sort_mem.  For things with less well ordered in the database, or where the
data set is really big (100s of megs of data being sorted) it often helps
to just grab a 100 meg sort_mem for the session.

If sort_mem is too big, the OS will likely wind up swapping it or shared
memory out and thrashing at the worst, or just surrendering all spare
memory to sort_mem, thus flushing all fs cache.  For a lot of apps, it's
all about the sweet spot of memory to each subsystem, and sort_mem can go
from nibbling memory to eating it like Nibbler from Futurama in seconds if
you set it just a little too high and have the right parallel load on your
server.

So, as long as you aren't starving your server of resources, setting
sort_mem higher is fine.


Re: Caching (was Re: choosing the right platform)

From
Josh Berkus
Date:
Ron,

> In a data warehouse we have here, we load about 3 million rows
> each week; load time improved from about 9 to 7 hours
> by breaking up such queries into expressions that only require
> one sort at a time, and surrounding the expressions with
> "set sort_mem=something_big" statements to give it enough
> space to not hit the disk.
>
>   SET SORT_MEM=300000;
>   CREATE TEMPORARY TABLE potential_new_values AS
>      SELECT DISTINCT val FROM import_table;
>   ...
>   SET SORT_MEM=1000;
>
> Anyone else have similar experience, or am I doing something
> wrong to need so much SORT_MEM?

No, this sounds very reasonable to me.  I do a similar operation on one of my
systems as part of a nightly data transformation for reporting.  Since I
haven't had to do those on tables over 150,000 rows, I haven't seen the kind
of RAM usage you experience.

> Below is an example of another real-world query from the same
> reporting system that benefits from a sort_mem over 32M.
> Explain analyze (below) shows a 40% improvement by having
> the sort fit in memory.

Cool!  That's a perfect example of sizing sort_mem for the query.   Mind if I
steal it for an article at some point?

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Caching (was Re: choosing the right platform)

From
"Ron Mayer"
Date:
Josh wrote:
>Ron,
>> Below is an example of another real-world query from the same
>> reporting system that benefits from a sort_mem over 32M.
>> Explain analyze (below) shows a 40% improvement by having
>> the sort fit in memory.
>
>Cool!  That's a perfect example of sizing sort_mem for the query.   Mind if I
>steal it for an article at some point?

Gladly!


BTW... if you're writing a tuning article, the most interesting one
I've seen is:
  http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux01.html
I like how they broke down the process in many steps and measured after each.
I'm was intrigued by how much Linux's VM tweaking (vm.bdflush) affected
performance mattered as much at the more-commontly tweaked "noatime".

   Ron