Thread: tuning questions

tuning questions

From

Jack Coates

Date:

04 December 2003, 12:07:16

Hi,

sorry for duplication, I asked this on pgsql-admin first before
realizing it wasn't the appropriate list.

I'm having trouble optimizing PostgreSQL for an admittedly heinous
worst-case scenario load.

testbed:
dual P3 1.3 GHz box with 2GB RAM
two IDE 120G drives on separate channels (DMA on), OS on one, DB on the
other, some swap on each (totalling 2.8G).
RH Linux 8.

I've installed PG 7.3.4 from source (./configure && make && make
install) and from PGDG RPMs and can switch back and forth. I also have
the 7.4 source but haven't done any testing with it yet aside from
starting it and importing some data.

The application is on another server, and does this torture test: it
builds a large table (~6 million rows in one test, ~18 million in
another). Rows are then pulled in chunks of 4 to 6 thousand, acted on,
and inserted back into another table (which will of course eventually
grow to the full size of the first).

The problem is that pulling the 4 to 6 thousand rows puts PostgreSQL
into a tail spin: postmaster hammers on CPU anywhere from 90 seconds to
five minutes before returning the data. During this time vmstat shows
that disk activity is up of course, but it doesn't appear to be with
page swapping (free and top and vmstat).

Another problem is that performance of the 6 million row job is decent
if I stop the job and run a vacuumdb --analyze before letting it
continue; is this something that 7.4 will help with? vacuumb --analyze
doesn't seem to have much effect on the 18 million row job.

I've tweaked shared buffers to 8192, pushed sort memory to 2048, vacuum
memory to 8192, and effective cache size to 10000.
/proc/sys/kernel/shmmax is set to 1600000000 and /proc/sys/fs/file-max
is set to 65536. Ulimit -n 3192.

I've read several sites and postings on tuning PG and have tried a
number of different theories, but I'm still not getting the architecture
of how things work.

thanks,
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Jeff

Date:

04 December 2003, 13:00:38

On Thu, 04 Dec 2003 08:06:23 -0800
Jack Coates <jack@lyris.com> wrote:

> testbed:
> dual P3 1.3 GHz box with 2GB RAM
> two IDE 120G drives on separate channels (DMA on), OS on one, DB on
> the other, some swap on each (totalling 2.8G).
> RH Linux 8.

Side Note: be sure to turn off write caching on those disks or you may
have data corruption in the event of a failure

> The problem is that pulling the 4 to 6 thousand rows puts PostgreSQL
> into a tail spin: postmaster hammers on CPU anywhere from 90 seconds
> to five minutes before returning the data. During this time vmstat
> shows that disk activity is up of course, but it doesn't appear to be
> with page swapping (free and top and vmstat).
>
Have you tried modifying the app to retrieve the rows in smaller chunks?
(use a cursor). this way it only needs to alloate memory to hold say,
100 rows at a time instead of 6000.

Also, have you explain analyze'd your queries to make sure PG is picking
a good plan to execute?

> I've tweaked shared buffers to 8192, pushed sort memory to 2048,
> vacuum memory to 8192, and effective cache size to 10000.
> /proc/sys/kernel/shmmax is set to 1600000000 and /proc/sys/fs/file-max
> is set to 65536. Ulimit -n 3192.

you should set effective cache size bigger, especially with 2GB of
memory. effective_cache_size tells PG 'about' how much data it cna
expect the OS to cache.

and.. I'm not sure about your query, but perhaps the sort of those 6000
rows is spilling to disk?  If you look in explain analyze you'll see in
the "Sort" step(s) it will tell you how many rows and how "wide" they
are.  If rows * width > sort_mem, it will have to spill the sort to
disk, which is slow.

If you post query info and explain analyze's we can help optimize the
query itself.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/

Re: tuning questions

From

Josh Berkus

Date:

04 December 2003, 13:00:42

Jack,

> The application is on another server, and does this torture test: it
> builds a large table (~6 million rows in one test, ~18 million in
> another). Rows are then pulled in chunks of 4 to 6 thousand, acted on,
> and inserted back into another table (which will of course eventually
> grow to the full size of the first).

>e tweaked shared buffers to 8192, pushed sort memory to 2048, vacuum
> memory to 8192, and effective cache size to 10000.
> /proc/sys/kernel/shmmax is set to 1600000000 and /proc/sys/fs/file-max
> is set to 65536. Ulimit -n 3192.

Have you read this?
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

Actually, your situation is not "worst case".  For one thing, your process is
effectively single-user; this allows you to throw all of your resources at
one user.    The problem is that your settings have effectively throttled PG
at a level appropriate to a many-user and/or multi-purpose system.  You need
to "open them up".

For something involving massive updating/transformation like this, once you've
done the basics (see that URL above) the main settings which will affect you
are sort_mem and checkpoint_segments, both of which I'd advise jacking way up
(test by increments).  Raising wal_buffers wouldn't hurt either.

Also, give some thought to running VACUUM and/or ANALYZE between segments of
your procedure.    Particularly if you do updates to many rows of a table and
then query based on the changed data, it is vital to run an ANALYZE first,
and usually a good idea to run a VACUUM if it was an UPDATE or DELETE and not
an INSERT.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: tuning questions

From

Rob Fielding

Date:

04 December 2003, 13:10:29

>
> I've tweaked shared buffers to 8192, pushed sort memory to 2048, vacuum
> memory to 8192, and effective cache size to 10000.
> /proc/sys/kernel/shmmax is set to 1600000000 and /proc/sys/fs/file-max
> is set to 65536. Ulimit -n 3192.

Your sharedmemory is too high, and not even being used effectivey. Your
other settings are too low.

Ball park guessing here, but I'd say first read (and understand) this:

http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

Then make shared memory about 10-20% available ram, and set:

((shmmax/1024) - ( 14.2 * max_connections ) - 250 ) / 8.2 = shared_buffers

decrease random_page_cost to 0.3 and wack up sort mem by 16 times,
effective cache size to about 50% RAM (depending on your other settings)
and try that for starters.


--

Rob Fielding
rob@dsvr.net

www.dsvr.co.uk              Development             Designer Servers Ltd

Re: tuning questions

From

"scott.marlowe"

Date:

04 December 2003, 13:46:41

On Thu, 4 Dec 2003, Jack Coates wrote:

> Another problem is that performance of the 6 million row job is decent
> if I stop the job and run a vacuumdb --analyze before letting it
> continue; is this something that 7.4 will help with? vacuumb --analyze
> doesn't seem to have much effect on the 18 million row job.

Just to add to what the others have said here, you probably want to run
the pg_autovacuum daemon in the background.  It comes with 7.4 but will
work fine with 7.3.

Re: tuning questions

From

Dror Matalon

Date:

04 December 2003, 13:58:13

On Thu, Dec 04, 2003 at 11:59:32AM -0500, Jeff wrote:
> On Thu, 04 Dec 2003 08:06:23 -0800
> Jack Coates <jack@lyris.com> wrote:
>
> > testbed:
> > dual P3 1.3 GHz box with 2GB RAM
> > two IDE 120G drives on separate channels (DMA on), OS on one, DB on
> > the other, some swap on each (totalling 2.8G).
> > RH Linux 8.
>
> Side Note: be sure to turn off write caching on those disks or you may
> have data corruption in the event of a failure

I've seen this comment several times from different people.
Would someone care to explain how you would get data corruption? I
thought that the whole idea of the log is to provide a journal similar
to what you get in a journaling file system.

In other words, the db writes a series of transactions to the log and marks
that "log entry" (don't know the right nomeclature) as valid. When the db
crashes, it reads the log, and discards the last "log entry" if it wasn't
marked as valid, and "replays" any transactions that haven't been
commited ot the db. The end result being that you might loose your last
transaction(s) if the db crashes, but nothing ever gets corrupted.

So what am I missing in this picture?

Regards,

Dror

--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

Re: tuning questions

From

Jord Tanner

Date:

04 December 2003, 14:08:34

If I understand the problem correctly, the issue is that IDE drives
signal that data has been written to disk when they actually are holding
the data in the write cache. In the case of a power down (and I remember
someone showing some test results confirming this, check the list
archive) the data in the drive write cache is lost, resulting in
corrupted logs.

Anyone else have more details?

Jord Tanner

On Thu, 2003-12-04 at 09:57, Dror Matalon wrote:
> On Thu, Dec 04, 2003 at 11:59:32AM -0500, Jeff wrote:
> > On Thu, 04 Dec 2003 08:06:23 -0800
> > Jack Coates <jack@lyris.com> wrote:
> >
> > > testbed:
> > > dual P3 1.3 GHz box with 2GB RAM
> > > two IDE 120G drives on separate channels (DMA on), OS on one, DB on
> > > the other, some swap on each (totalling 2.8G).
> > > RH Linux 8.
> >
> > Side Note: be sure to turn off write caching on those disks or you may
> > have data corruption in the event of a failure
>
> I've seen this comment several times from different people.
> Would someone care to explain how you would get data corruption? I
> thought that the whole idea of the log is to provide a journal similar
> to what you get in a journaling file system.
>
> In other words, the db writes a series of transactions to the log and marks
> that "log entry" (don't know the right nomeclature) as valid. When the db
> crashes, it reads the log, and discards the last "log entry" if it wasn't
> marked as valid, and "replays" any transactions that haven't been
> commited ot the db. The end result being that you might loose your last
> transaction(s) if the db crashes, but nothing ever gets corrupted.
>
> So what am I missing in this picture?
>
> Regards,
>
> Dror
--
Jord Tanner <jord@indygecko.com>

Re: tuning questions

From

Andrew Sullivan

Date:

04 December 2003, 14:12:28

On Thu, Dec 04, 2003 at 09:57:38AM -0800, Dror Matalon wrote:
>
> I've seen this comment several times from different people.
> Would someone care to explain how you would get data corruption? I
> thought that the whole idea of the log is to provide a journal similar
> to what you get in a journaling file system.

> So what am I missing in this picture?

That a journalling file system can _also_ have file corruption if you
have write caching enabled and no battery back up.  If the drive
tells the OS, "Yep!  It's all on the disk!" bit it is _not_ actually
scribed in the little bitty magnetic patterns -- and at that very
moment, the power goes away -- the data that was reported to have been
on the disk, but which was actually _not_ on the disk, is no longer
anywhere.  (Well, except in the past.  But time travel was disabled
some versions ago. ;-)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Afilias Canada                        Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110

Re: tuning questions

From

Josh Berkus

Date:

04 December 2003, 14:17:45

Scott,

> Just to add to what the others have said here, you probably want to run
> the pg_autovacuum daemon in the background.  It comes with 7.4 but will
> work fine with 7.3.

I don't recommend using pg_autovacuum with a data transformation task.   pg_av
is designed for "regular use" not huge batch tasks.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: tuning questions

From

"scott.marlowe"

Date:

04 December 2003, 14:31:13

On Thu, 4 Dec 2003, Josh Berkus wrote:

> Scott,
>
> > Just to add to what the others have said here, you probably want to run
> > the pg_autovacuum daemon in the background.  It comes with 7.4 but will
> > work fine with 7.3.
>
> I don't recommend using pg_autovacuum with a data transformation task.   pg_av
> is designed for "regular use" not huge batch tasks.

What bad thing is likely to happen if it's used here?  Fire too often or
use too much I/O bandwidth?  Would that be fixed by the patch being tested
to introduce a delay every x pages of vacuuming?

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 15:17:26

On Thu, 2003-12-04 at 09:12, Rob Fielding wrote:
> >
> > I've tweaked shared buffers to 8192, pushed sort memory to 2048, vacuum
> > memory to 8192, and effective cache size to 10000.
> > /proc/sys/kernel/shmmax is set to 1600000000 and /proc/sys/fs/file-max
> > is set to 65536. Ulimit -n 3192.
>
> Your sharedmemory is too high, and not even being used effectivey. Your
> other settings are too low.
>
> Ball park guessing here, but I'd say first read (and understand) this:
>
> http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

I've read it many times, understanding is slower :-)

>
> Then make shared memory about 10-20% available ram, and set:
>
> ((shmmax/1024) - ( 14.2 * max_connections ) - 250 ) / 8.2 = shared_buffers
>
> decrease random_page_cost to 0.3 and wack up sort mem by 16 times,
> effective cache size to about 50% RAM (depending on your other settings)
> and try that for starters.

Following this, I've done:
2gb ram
=
 2,000,000,000
bytes

15 % of that
=
   300,000,000
bytes

divided by
1024
=
       292,969
kbytes

max_conn *
14.2
=
           454
kbytes

subtract c4
=
       292,514
kbytes

subtract 250
=
       292,264
kbytes

divide by 8.2
=
        35,642
shared_buffers

performance is unchanged for the 18M job -- pg continues to use ~
285-300M, system load and memory usage stay the same. I killed that,
deleted from the affected tables, inserted a 6M job, and started a
vacuumdb --anaylze. It's been running for 20 minutes now...

getting the SQL query better optimized for PG is on my todo list, but
not something I can do right now -- this application is designed to be
cross-platform with MS-SQL, PG, and Oracle so tweaking SQL is a touchy
subject.

The pgavd conversation is intriguing, but I don't really understand the
role of vacuuming. Would this be a correct statement: "PG needs to
regularly re-evaluate the database in order to adjust itself?" I'm
imagining that it continues to treat the table as a small one until
vacuum informs it that the table is now large?
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Eric Soroos

Date:

04 December 2003, 15:17:50

On Dec 4, 2003, at 10:11 AM, Andrew Sullivan wrote:

> On Thu, Dec 04, 2003 at 09:57:38AM -0800, Dror Matalon wrote:
>>
>> I've seen this comment several times from different people.
>> Would someone care to explain how you would get data corruption? I
>> thought that the whole idea of the log is to provide a journal similar
>> to what you get in a journaling file system.
>
>> So what am I missing in this picture?
>
> That a journalling file system can _also_ have file corruption if you
> have write caching enabled and no battery back up.  If the drive
> tells the OS, "Yep!  It's all on the disk!" bit it is _not_ actually
> scribed in the little bitty magnetic patterns -- and at that very
> moment, the power goes away -- the data that was reported to have been
> on the disk, but which was actually _not_ on the disk, is no longer
> anywhere.  (Well, except in the past.  But time travel was disabled
> some versions ago. ;-)

It's not just a theoretical problem.  It's happened to me on a laptop
drive in the last week or so.

I was testing out dbmail by hammering on it on Panther laptop, hfs+
journaling enabled, psql 7.4, latest and greatest.  I managed to hang
the system hard, requiring a reboot. Psql wouldn't start after the
crash, complaining of a damaged relation and helpfully telling me that
'you may need to restore from backup'.

No big deal on the data loss, since it was a test/hammering
installation. It would have been nice to be able to drop that relation
or prune the entire database, but I'm sure that would ultimately run
into referential integrity problems.

eric

Re: tuning questions

From

Josh Berkus

Date:

04 December 2003, 15:34:26

Jack,

> Following this, I've done:
> 2gb ram
> =
>  2,000,000,000
> bytes

This calculation is fun, but I really don't know where you got it from.   It
seems quite baroque.  What are you trying to set, exactly?

> getting the SQL query better optimized for PG is on my todo list, but
> not something I can do right now -- this application is designed to be
> cross-platform with MS-SQL, PG, and Oracle so tweaking SQL is a touchy
> subject.

Well, if you're queries are screwed up, no amount of .conf optimization is
going to help you much.     You could criticize that PG is less adept than
some other systems at re-writing "bad queries", and you would be correct.
However, there's not much to do about that on existing systems.

How about posting some sample code?

> The pgavd conversation is intriguing, but I don't really understand the
> role of vacuuming. Would this be a correct statement: "PG needs to
> regularly re-evaluate the database in order to adjust itself?" I'm
> imagining that it continues to treat the table as a small one until
> vacuum informs it that the table is now large?

Not Vacuum, Analyze.  Otherwise correct.  Mind you, in "regular use" where
only a small % of the table changes per hour, periodic ANALYZE is fine.
However, in "batch data transform" analyze statements need to be keyed to the
updates and/or imports.

BTW, I send a couple of e-mails to the Lyris documentation maintainer about
updating out-of-date information about setting up PostgreSQL.   I never got a
response, and I don't think my changes were made.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 15:55:33

On Thu, 2003-12-04 at 11:20, Josh Berkus wrote:
> Jack,
>
> > Following this, I've done:
> > 2gb ram
> > =
> >  2,000,000,000
> > bytes
>
> This calculation is fun, but I really don't know where you got it from.   It
> seems quite baroque.  What are you trying to set, exactly?
Message-ID:  <3FCF6AEB.908@dsvr.net>
Date: Thu, 04 Dec 2003 17:12:11 +0000
From: Rob Fielding <rob@dsvr.net

I'm trying to set Postgres's shared memory usage in a fashion that
allows it to return requested results quickly. Unfortunately, none of
these changes allow PG to use more than a little under 300M RAM.
vacuumdb --analyze is now taking an inordinate amount of time as well
(40 minutes and counting), so that change needs to be rolled back.

>
> > getting the SQL query better optimized for PG is on my todo list, but
> > not something I can do right now -- this application is designed to be
> > cross-platform with MS-SQL, PG, and Oracle so tweaking SQL is a touchy
> > subject.
>
> Well, if you're queries are screwed up, no amount of .conf optimization is
> going to help you much.     You could criticize that PG is less adept than
> some other systems at re-writing "bad queries", and you would be correct.
> However, there's not much to do about that on existing systems.
>
> How about posting some sample code?

Tracking that down in CVS and translating from C++ is going to take a
while -- is there a way to get PG to log the queries it's receiving?

>
> > The pgavd conversation is intriguing, but I don't really understand the
> > role of vacuuming. Would this be a correct statement: "PG needs to
> > regularly re-evaluate the database in order to adjust itself?" I'm
> > imagining that it continues to treat the table as a small one until
> > vacuum informs it that the table is now large?
>
> Not Vacuum, Analyze.  Otherwise correct.  Mind you, in "regular use" where
> only a small % of the table changes per hour, periodic ANALYZE is fine.
> However, in "batch data transform" analyze statements need to be keyed to the
> updates and/or imports.
>
> BTW, I send a couple of e-mails to the Lyris documentation maintainer about
> updating out-of-date information about setting up PostgreSQL.   I never got a
> response, and I don't think my changes were made.

She sits on the other side of the cube wall from me, and if I find a
decent config it's going into the manual -- consider this a golden
opportunity :-)

--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 16:44:09

On Thu, 2003-12-04 at 12:27, Richard Huxton wrote:
> On Thursday 04 December 2003 19:50, Jack Coates wrote:
> >
> > I'm trying to set Postgres's shared memory usage in a fashion that
> > allows it to return requested results quickly. Unfortunately, none of
> > these changes allow PG to use more than a little under 300M RAM.
> > vacuumdb --analyze is now taking an inordinate amount of time as well
> > (40 minutes and counting), so that change needs to be rolled back.
>
> You don't want PG to use all your RAM, it's designed to let the underlying OS
> do a lot of caching for it. Probably worth having a look at vmstat/iostat and
> see if it's saturating on I/O.

latest changes:
shared_buffers = 35642
max_fsm_relations = 1000
max_fsm_pages = 10000
wal_buffers = 64
sort_mem = 32768
vacuum_mem = 32768
effective_cache_size = 10000

/proc/sys/kernel/shmmax = 500000000

IO is active, but hardly saturated. CPU load is hefty though, load
average is at 4 now.

   procs                      memory    swap          io
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
sy  id
 0  2  1   2808  11436  39616 1902988   0   0   240   896  765   469
2  11  87
 0  2  1   2808  11432  39616 1902988   0   0   244   848  768   540
4   3  93
 0  2  1   2808  11432  39616 1902984   0   0   204   876  788   507
3   4  93
 0  2  1   2808  11432  39616 1902984   0   0   360   416  715   495
4   1  96
 0  2  1   2808  11432  39616 1902984   0   0   376   328  689   441
2   1  97
 0  2  0   2808  11428  39616 1902976   0   0   464   360  705   479
2   1  97
 0  2  1   2808  11428  39616 1902976   0   0   432   380  718   547
3   1  97
 0  2  1   2808  11428  39616 1902972   0   0   440   372  742   512
1   3  96
 0  2  1   2808  11428  39616 1902972   0   0   416   364  711   504
3   1  96
 0  2  1   2808  11424  39616 1902972   0   0   456   492  743   592
2   1  97
 0  2  1   2808  11424  39616 1902972   0   0   440   352  707   494
2   1  97
 0  2  1   2808  11424  39616 1902972   0   0   456   360  709   494
2   2  97
 0  2  1   2808  11436  39616 1902968   0   0   536   516  807   708
3   2  94

--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Richard Huxton

Date:

04 December 2003, 16:53:29

On Thursday 04 December 2003 19:50, Jack Coates wrote:
>
> I'm trying to set Postgres's shared memory usage in a fashion that
> allows it to return requested results quickly. Unfortunately, none of
> these changes allow PG to use more than a little under 300M RAM.
> vacuumdb --analyze is now taking an inordinate amount of time as well
> (40 minutes and counting), so that change needs to be rolled back.

You don't want PG to use all your RAM, it's designed to let the underlying OS
do a lot of caching for it. Probably worth having a look at vmstat/iostat and
see if it's saturating on I/O.

--
  Richard Huxton
  Archonet Ltd

Re: tuning questions

From

Josh Berkus

Date:

04 December 2003, 17:25:21

Jack,

> latest changes:
> shared_buffers = 35642

This is fine, it's about 14% of available RAM.  Though the way you calculated
it still confuses me.   It's not complicated; it should be between 6% and 15%
of available RAM; since you're doing a data-transformation DB, yours should
be toward the high end.

> max_fsm_relations = 1000
> max_fsm_pages = 10000

You want to raise this a whole lot if your data transformations involve large
delete or update batches.    I'd suggest running "vacuum analyze verbose"
between steps to see how many dead pages you're accumulating.

> wal_buffers = 64
> sort_mem = 32768
> vacuum_mem = 32768
> effective_cache_size = 10000

This is way the heck too low.  it's supposed to be the size of all available
RAM; I'd set it to 2GB*65% as a start.

> IO is active, but hardly saturated. CPU load is hefty though, load
> average is at 4 now.

Unless you're doing huge statistical aggregates (like radar charts), or heavy
numerical calculations-by-query, high CPU and idle I/O usually indicates a
really bad query, like badly mismatched data types on a join or unconstrained
joins or  overblown formatting-by-query.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: tuning questions

From

"scott.marlowe"

Date:

04 December 2003, 17:29:14

On Thu, 4 Dec 2003, Jack Coates wrote:

> On Thu, 2003-12-04 at 12:27, Richard Huxton wrote:
> > On Thursday 04 December 2003 19:50, Jack Coates wrote:
> > >
> > > I'm trying to set Postgres's shared memory usage in a fashion that
> > > allows it to return requested results quickly. Unfortunately, none of
> > > these changes allow PG to use more than a little under 300M RAM.
> > > vacuumdb --analyze is now taking an inordinate amount of time as well
> > > (40 minutes and counting), so that change needs to be rolled back.
> >
> > You don't want PG to use all your RAM, it's designed to let the underlying OS
> > do a lot of caching for it. Probably worth having a look at vmstat/iostat and
> > see if it's saturating on I/O.
>
> latest changes:
> shared_buffers = 35642
> max_fsm_relations = 1000
> max_fsm_pages = 10000
> wal_buffers = 64
> sort_mem = 32768
> vacuum_mem = 32768
> effective_cache_size = 10000
>
> /proc/sys/kernel/shmmax = 500000000
>
> IO is active, but hardly saturated. CPU load is hefty though, load
> average is at 4 now.

Postgresql is busily managing a far too large shared buffer.  Let the
kernel do that.  Postgresql's shared buffers should be bug enough to hold
as much of the current working set as it can, up to about 25% or so of the
servers memory, or 512Meg, whichever comes first.  Unless a single query
will actually use all of the buffer at once, you're not likely to see an
improvement.

Also, your effective cache size is really small.  On a typical Postgresql
server with 2 gigs of ram, you'll have about 1 to 1.5 gigs as kernel cache
and buffer, and if it's dedicated to postgresql, then the effective cache
setting for 1 gig would be 131072 (assuming 8k pages).

If you're updating a lot of tuples without vacuums, you'll likely want to
up your fsm settings.

Note you can change things like sort_mem, effective_cache_size and
random_page_cost on the fly (but not buffers, they're allocated at
startup, nor fsm, they are as well.)

so, if you're gonna have one huge honkin query that needs to sort a
hundred megs at a time, but you'd rather not up your sort memory that high
(sort mem is PER SORT, not per backend or per database, so it can get out
of hand quickly) then you can just

set sort_mem=128000;

before throwing out the big queries that need all the sort.

Re: tuning questions

From

Eric Soroos

Date:

04 December 2003, 19:00:21

>
> IO is active, but hardly saturated. CPU load is hefty though, load
> average is at 4 now.
>
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs
> us  sy  id

>  0  2  1   2808  11432  39616 1902984   0   0   204   876  788   507
> 3   4  93

You're getting a load average of 4 with 93% idle?

That's a reasonable number of context switches, and if the blocks
you're reading/writing are discontinous, I could see io saturation
rearing it's head.

This looks to me like you're starting and killing a lot of processes.

Is this thrashing psql connections, or is it one big query? What are
your active processes?

Your effective cache size looks to be about 1900 megs (+- binary),
assuming all of it is pg.

eric

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 19:17:24

On Thu, 2003-12-04 at 13:24, Josh Berkus wrote:
> Jack,
>
> > latest changes:
> > shared_buffers = 35642
>
> This is fine, it's about 14% of available RAM.  Though the way you calculated
> it still confuses me.   It's not complicated; it should be between 6% and 15%
> of available RAM; since you're doing a data-transformation DB, yours should
> be toward the high end.
>
> > max_fsm_relations = 1000
> > max_fsm_pages = 10000
>
> You want to raise this a whole lot if your data transformations involve large
> delete or update batches.    I'd suggest running "vacuum analyze verbose"
> between steps to see how many dead pages you're accumulating.

This looks really difficult to tune, and based on the load I'm giving
it, it looks really important. I've tried the verbose analyze and I've
looked at the rules of thumb, neither approach seems good for the
pattern of "hammer the system for a day or two, then leave it alone for
a week." I'm setting it to 500000 (half of the biggest table size
divided by a 6k page size), but I'll keep tweaking this.

>
> > wal_buffers = 64
> > sort_mem = 32768
> > vacuum_mem = 32768
> > effective_cache_size = 10000
>
> This is way the heck too low.  it's supposed to be the size of all available
> RAM; I'd set it to 2GB*65% as a start.

This makes a little bit of difference. I set it to 65% (15869 pages).
Now we have some real disk IO:
   procs                      memory    swap          io
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
sy  id
 0  3  1   2804  10740  40808 1899856   0   0 26624     0  941  4144
13  24  63
 1  2  1   2804  10808  40808 1899848   0   0 21748    60 1143  3655
9  22  69

still high cpu (3-ish load) though, and there's no noticeable
improvement in query speed.

>
> > IO is active, but hardly saturated. CPU load is hefty though, load
> > average is at 4 now.
>
> Unless you're doing huge statistical aggregates (like radar charts), or heavy
> numerical calculations-by-query, high CPU and idle I/O usually indicates a
> really bad query, like badly mismatched data types on a join or unconstrained
> joins or  overblown formatting-by-query.

Ran that by the programmer responsible for this area and watched the
statements go by with tcpdump -X. Looks like really simple stuff to me:
select a handful of values, then insert into one table and delete from
another.
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 19:20:48

On Thu, 2003-12-04 at 14:59, Eric Soroos wrote:
> >
> > IO is active, but hardly saturated. CPU load is hefty though, load
> > average is at 4 now.
> >
> >    procs                      memory    swap          io
> > system         cpu
> >  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs
> > us  sy  id
>
> >  0  2  1   2808  11432  39616 1902984   0   0   204   876  788   507
> > 3   4  93
>
> You're getting a load average of 4 with 93% idle?
down a bit since my last set of tweaks, but yeah:
  3:18pm  up 2 days,  3:37,  3 users,  load average: 3.42, 3.31, 2.81
66 processes: 65 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:  2.0% user,  3.4% system,  0.0% nice, 93.4% idle
CPU1 states:  1.3% user,  2.3% system,  0.0% nice, 95.2% idle
Mem:  2064656K av, 2053896K used,   10760K free,       0K shrd,   40388K
buff
Swap: 2899716K av,    2800K used, 2896916K free                 1896232K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
23103 root      15   0  1072 1072   840 R     1.3  0.0   0:01 top
23046 postgres  15   0 33364  32M 32220 S     0.5  1.6   0:12 postmaster
>
> That's a reasonable number of context switches, and if the blocks
> you're reading/writing are discontinous, I could see io saturation
> rearing it's head.
>
> This looks to me like you're starting and killing a lot of processes.

isn't that by design though? I've been looking at other postgres servers
around the company and they seem to act pretty similar under load (none
is being pounded to this level, though).

>
> Is this thrashing psql connections, or is it one big query? What are
> your active processes?

[root@postgres root]# ps auxw | grep postgres
postgres 23042  0.0  0.4 308808 8628 pts/0   S    14:46   0:00
/usr/bin/postmaster -p 5432
postgres 23043  0.0  0.4 309788 8596 pts/0   S    14:46   0:00 postgres:
stats buffer process
postgres 23044  0.0  0.4 308828 8620 pts/0   S    14:46   0:00 postgres:
stats collector process
postgres 23046  0.6  1.4 309952 29872 pts/0  R    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT waiting
postgres 23047  1.4 14.7 310424 304240 pts/0 S    14:46   0:21 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23048  0.4 14.7 310044 304368 pts/0 S    14:46   0:07 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23049  0.0  0.5 309820 10352 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23050  0.0  0.6 310424 13352 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23051  0.0  0.6 309940 12992 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23052  0.0  0.5 309880 11916 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23053  0.0  0.6 309924 12872 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23054  0.0  0.6 310012 13460 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23055  0.0  0.5 309932 12284 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23056  2.0 14.7 309964 304072 pts/0 S    14:46   0:30 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23057  2.4 14.7 309916 304104 pts/0 S    14:46   0:37 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23058  0.0  0.6 310392 13168 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23059  0.5 14.7 310424 304072 pts/0 S    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23060  0.0  0.6 309896 13212 pts/0  S    14:46   0:00 postgres:
lmuser lmdb 10.0.0.2 idle
postgres 23061  0.5  1.4 309944 29832 pts/0  R    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT
postgres 23062  0.6  1.4 309936 29832 pts/0  S    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT waiting
postgres 23063  0.6  1.4 309944 30028 pts/0  S    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT waiting
postgres 23064  0.6  1.4 309944 29976 pts/0  S    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT waiting
postgres 23065  1.4 14.7 310412 304112 pts/0 S    14:46   0:21 postgres:
lmuser lmdb 216.91.56.200 idle
postgres 23066  0.5  1.4 309944 29496 pts/0  S    14:46   0:08 postgres:
lmuser lmdb 216.91.56.200 INSERT waiting
postgres 23067  0.5  1.4 310472 30040 pts/0  D    14:46   0:09 postgres:
lmuser lmdb 216.91.56.200 idle
postgres 23068  0.6  1.4 309936 30104 pts/0  R    14:46   0:09 postgres:
lmuser lmdb 216.91.56.200 INSERT waiting
postgres 23069  0.5  1.4 309936 29716 pts/0  S    14:46   0:09 postgres:
lmuser lmdb 216.91.56.200 INSERT waiting
postgres 23070  0.6  1.4 309944 29744 pts/0  S    14:46   0:09 postgres:
lmuser lmdb 10.0.0.2 INSERT waiting

ten-ish stay idle all the time, the inserts go to update when the big
select is done and rows get moved from the active to the completed
table.

> Your effective cache size looks to be about 1900 megs (+- binary),
> assuming all of it is pg.
>
> eric
>
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Richard Huxton

Date:

04 December 2003, 19:48:31

On Thursday 04 December 2003 23:16, Jack Coates wrote:
>
> > > effective_cache_size = 10000
> >
> > This is way the heck too low.  it's supposed to be the size of all
> > available RAM; I'd set it to 2GB*65% as a start.
>
> This makes a little bit of difference. I set it to 65% (15869 pages).

That's still only about 127MB (15869 * 8KB).

> Now we have some real disk IO:
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
> sy  id
>  0  3  1   2804  10740  40808 1899856   0   0 26624     0  941  4144

According to this your cache is currently 1,899,856 KB which in 8KB blocks is
237,482 - be frugal and say effective_cache_size = 200000 (or even 150000 if
the trace above isn't typical).

--
  Richard Huxton
  Archonet Ltd

Re: tuning questions

From

Jack Coates

Date:

04 December 2003, 20:33:17

On Thu, 2003-12-04 at 15:47, Richard Huxton wrote:
> On Thursday 04 December 2003 23:16, Jack Coates wrote:
> >
> > > > effective_cache_size = 10000
> > >
> > > This is way the heck too low.  it's supposed to be the size of all
> > > available RAM; I'd set it to 2GB*65% as a start.
> >
> > This makes a little bit of difference. I set it to 65% (15869 pages).
>
> That's still only about 127MB (15869 * 8KB).

yeah, missed the final digit when I copied it into the postgresql.conf
:-( Just reloaded with 158691 pages.
>
> > Now we have some real disk IO:
> >    procs                      memory    swap          io
> > system         cpu
> >  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
> > sy  id
> >  0  3  1   2804  10740  40808 1899856   0   0 26624     0  941  4144
>
> According to this your cache is currently 1,899,856 KB which in 8KB blocks is
> 237,482 - be frugal and say effective_cache_size = 200000 (or even 150000 if
> the trace above isn't typical).

d'oh, just realized what you're telling me here. /me smacks forehead.
Let's try effective_cache of 183105... (75%). Starting both servers,
waiting for big fetch to start, and...

   procs                      memory    swap          io
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
sy  id
 0  0  0   2800  11920  40532 1906516   0   0     0     0  521     8
0   0 100
 0  1  0   2800  11920  40532 1906440   0   0   356    52  611   113
1   3  97
 0  1  0   2800  11920  40532 1906424   0   0 20604     0  897   808
1  18  81
 0  1  0   2800  11920  40532 1906400   0   0 26112     0  927   820
1  13  87
 0  1  0   2800  11920  40532 1906384   0   0 26112     0  923   812
1  12  87
 0  1  0   2800  11920  40532 1906372   0   0 24592     0  921   805
1  13  87
 0  1  0   2800  11920  40532 1906368   0   0  3248    48  961  1209
0   4  96
 0  1  0   2800  11920  40532 1906368   0   0  2600     0  845  1631
0   2  98
 0  1  0   2800  11920  40532 1906364   0   0  2728     0  871  1714
0   2  98

better in vmstat... but the query doesn't work any better unfortunately.

The frustrating thing is, we also have a UP P3-500 with 512M RAM and two
IDE drives with the same PG install which is doing okay with this load
-- still half the speed of MS-SQL2K, but usable. I'm at a loss.
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Eric Soroos

Date:

05 December 2003, 00:53:36

>
> d'oh, just realized what you're telling me here. /me smacks forehead.
> Let's try effective_cache of 183105... (75%). Starting both servers,
> waiting for big fetch to start, and...
>
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
> sy  id
>  0  0  0   2800  11920  40532 1906516   0   0     0     0  521     8
> 0   0 100
>  0  1  0   2800  11920  40532 1906440   0   0   356    52  611   113
> 1   3  97
>  0  1  0   2800  11920  40532 1906424   0   0 20604     0  897   808
> 1  18  81
>  0  1  0   2800  11920  40532 1906400   0   0 26112     0  927   820
> 1  13  87
>  0  1  0   2800  11920  40532 1906384   0   0 26112     0  923   812
> 1  12  87
>  0  1  0   2800  11920  40532 1906372   0   0 24592     0  921   805
> 1  13  87
>  0  1  0   2800  11920  40532 1906368   0   0  3248    48  961  1209
> 0   4  96
>  0  1  0   2800  11920  40532 1906368   0   0  2600     0  845  1631
> 0   2  98
>  0  1  0   2800  11920  40532 1906364   0   0  2728     0  871  1714
> 0   2  98
>
> better in vmstat... but the query doesn't work any better
> unfortunately.

Your io now looks like you're getting a few seconds of continuous read,
and then you're getting into maxing out random reads. These look about
right for a single ide drive.

> The frustrating thing is, we also have a UP P3-500 with 512M RAM and
> two
> IDE drives with the same PG install which is doing okay with this load
> -- still half the speed of MS-SQL2K, but usable. I'm at a loss.

I wonder if you're doing table scans. From the earlier trace, it looked
like you have a few parallel select/process/insert processes going.

If that's the case, you might be getting a big sequential scan at
first, then at some point you have enough selects going that it wtarts
looking more like random access.

Can you run one of the selects from the psql console and see how fast
it runs?  Do your inserts have any foreign key relations?

One thing you might try is to shut down the postmaster and move the
pg_clog and pg_xlog directories to the other drive, and leave symlinks
pointing back. That should help your insert performance by putting the
wal on a seperate drive from the table data. It will really help if you
wind up having uncached read and write access at the same time. You
also might gain by using software raid 0 (with large stripe size, 512k
or so) across both drives, but if you don't have the appropriate
paritions in there now it's going to be a bunch of work.

eric

Re: tuning questions

From

Thierry Missimilly

Date:

05 December 2003, 05:13:33

Jack Coates wrote:

>
> latest changes:
> shared_buffers = 35642
> max_fsm_relations = 1000
> max_fsm_pages = 10000
> wal_buffers = 64
> sort_mem = 32768
> vacuum_mem = 32768
> effective_cache_size = 10000
>
> /proc/sys/kernel/shmmax = 500000000
>
> IO is active, but hardly saturated. CPU load is hefty though, load
> average is at 4 now.
>
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us
> sy  id
>  0  2  1   2808  11436  39616 1902988   0   0   240   896  765   469
> 2  11  87
>  0  2  1   2808  11432  39616 1902988   0   0   244   848  768   540
> 4   3  93
>  0  2  1   2808  11432  39616 1902984   0   0   204   876  788   507
> 3   4  93
>  0  2  1   2808  11432  39616 1902984   0   0   360   416  715   495
> 4   1  96
>  0  2  1   2808  11432  39616 1902984   0   0   376   328  689   441
> 2   1  97
>  0  2  0   2808  11428  39616 1902976   0   0   464   360  705   479
> 2   1  97
>  0  2  1   2808  11428  39616 1902976   0   0   432   380  718   547
> 3   1  97
>  0  2  1   2808  11428  39616 1902972   0   0   440   372  742   512
> 1   3  96
>  0  2  1   2808  11428  39616 1902972   0   0   416   364  711   504
> 3   1  96
>  0  2  1   2808  11424  39616 1902972   0   0   456   492  743   592
> 2   1  97
>  0  2  1   2808  11424  39616 1902972   0   0   440   352  707   494
> 2   1  97
>  0  2  1   2808  11424  39616 1902972   0   0   456   360  709   494
> 2   2  97
>  0  2  1   2808  11436  39616 1902968   0   0   536   516  807   708
> 3   2  94
>

Hi Jack,

As show by vmstat, your Operating System is spending 96% of its time in Idle. On
RedHat 8.0 IA32, Idle means idle and Wait I/O.
In your case, i think they are Wait I/O as you are working on 2.8 GB  DB with only
2GB RAM, but it should be arround 30%.
Your performances whould increase only if User CPU increase otherwise, for exemple
if your system swap, only Sys CPU whould increase and your application will stay
slow.

You can better check your I/O with : iostat 3 1000, and check that the max tps are
on the database filesystem.

So, all the Postgres tuning you have tried do not change a lot as the bottleneck is
your I/O throuput.
But, one thing you can check is which parts of Postgres need a lot of I/O.
To do that, after shuting down PG, move your database on an other disk (OS disk ?)
for exemple /mypg/data and create a symblolic link for /mypg/data/<mydb> to
$PGDATA/base.

Restart PG, and while you execute your application, check with iostat which disk as
the max of tps. I bet, it is the disk where the WAL buffer are logged.

One more thing about I/O, for an IDE disk, the maximum number of Write Block + Read
Block per sec is about 10000 based on the I/O block size is 1 K. That means 10
Mb/s. if you need more, you can try Stripped SCSI disks or RAID0 subsystem disks.

Thierry Missimilly

>
> --
> Jack Coates, Lyris Technologies Applications Engineer
> 510-549-4350 x148, jack@lyris.com
> "Interoperability is the keyword, uniformity is a dead end."
>                                 --Olivier Fourdan
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly

Attachment

THIERRY.MISSIMILLY.vcf

Re: tuning questions

From

Josh Berkus

Date:

05 December 2003, 13:27:20

Jack,

> The frustrating thing is, we also have a UP P3-500 with 512M RAM and two
> IDE drives with the same PG install which is doing okay with this load
> -- still half the speed of MS-SQL2K, but usable. I'm at a loss.

Overall, I'm really getting the feeling that this procedure was optimized for
Oracle and/or MSSQL and is hitting some things that aren't such a good idea
for PostgreSQL.   I highly suggest that you try using log_duration and
log_statement (and in 7.4 log_min_duration_statement) to try to locate which
particular statements are taking the longest.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: tuning questions

From

Jack Coates

Date:

05 December 2003, 21:23:38

On Fri, 2003-12-05 at 09:26, Josh Berkus wrote:
> Jack,
>
> > The frustrating thing is, we also have a UP P3-500 with 512M RAM and two
> > IDE drives with the same PG install which is doing okay with this load
> > -- still half the speed of MS-SQL2K, but usable. I'm at a loss.
>
> Overall, I'm really getting the feeling that this procedure was optimized for
> Oracle and/or MSSQL and is hitting some things that aren't such a good idea
> for PostgreSQL.   I highly suggest that you try using log_duration and
> log_statement (and in 7.4 log_min_duration_statement) to try to locate which
> particular statements are taking the longest.

I'll definitely buy that as round two of optimization, but round one is
still "it's faster on the slower server."

hdparm -I is identical between the boxes, filesystem structure layout is
identical, disk organization isn't identical, but far worse: the UP low
ram box has PG on /dev/hdb, ew. Predictably, vmstat shows low numbers...
but steady numbers.

dev is the box which goes fast, and I was wrong, it's actually a 2GHz
P4. rufus is the box which goes slow. During the big fetch:
dev bi sits around 2000 blocks for twenty seconds while bo is around 50
blocks, then bo jumps to 800 or so while the data is returned, then
we're done.

rufus bi starts at 16000 blocks, then drops steadily while bo climbs.
After a minute or so, bi stabilizes at 4096 blocks, then bo bursts to
return the data. Then the next fetch starts, and it's bi of 500, bo of
300 for several minutes.

These observations certainly all point to Eric and Thierry's
recommendations to better organize the filesystem and get faster disks..
except that the dev box gets acceptable performance.

So, I've dug into postgresql.conf on dev and rufus, and here's what I
found:

RUFUS

how much
ram do
you
have?

75%
converted to 8K pages of that for effective_cache

15% of
that or
512M,
whichever is larger, converted to 8K pages for shared_buffers
15% of
that
converted to 8K pages for vacuum_mem

how many
messages
will you
send
between
vacuums?

divide
that by
2 and
divide
by 6 for
max_fsm_pages

DEV

how much
ram do
you
have?

48%
converted to 8K pages of that for effective_cache

6.5% of
that or
512M,
whichever is larger, converted to 8K pages for shared_buffers
52% of
that
converted to 8K pages for vacuum_mem

max_fsm_pages untouched on this box.

I adjusted rufus's configuration to match those percentages, but left
max_fsm_pages dialed up to 500000. Now Rufus's vmstat shows much better
behavior: bi 12000 blocks gradually sloping down to 3000 during the big
select, bo steady until it's ready to return. As more jobs come in, we
see overlap areas where bi is 600-ish and bo is 200-ish, but they only
last a few tens of seconds.

The big selects are still a lot slower than they are on the smaller
database and overall performance is still unacceptable. Next I dialed
max_fsm_pages back down to 10000 -- no change. Hm, maybe it's been too
long since the last vacuumdb --analyze, let's give it another.

hdparm -Tt shows that disk performance is crappo on rufus, half what it
is on dev -- and freaking dev is using 16 bit IO! This is a motherboard
IDE controller issue.

South Bridge:                       VIA vt8233
Revision:                           ISA 0x0 IDE 0x6

That's it, I'm throwing out this whole test series and starting over
with different hardware. Database server is now a dual 2GHz Xeon with
2GB RAM & 2940UW SCSI, OS and PG's logs on 36G drive, PG data on 9GB
drive. Data is importing now and I'll restart the tests tonight.
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Jack Coates

Date:

08 December 2003, 13:44:29

On Fri, 2003-12-05 at 17:22, Jack Coates wrote:
...
> That's it, I'm throwing out this whole test series and starting over
> with different hardware. Database server is now a dual 2GHz Xeon with
> 2GB RAM & 2940UW SCSI, OS and PG's logs on 36G drive, PG data on 9GB
> drive. Data is importing now and I'll restart the tests tonight.

Sorry to reply at myself, but thought I'd note that the performance is
practically unchanged by moving to better hardware and separating logs
and data onto different spindles. Although the disks are twice as fast
by hdparm -Tt, their behavior as shown by iostat and vmstat is little
different between dual and dev (single P4-2GHz/512MB/(2)IDE drives).
Dual is moderately faster than my first, IDE-based testbed (about 8%),
but still only 30% as fast as the low-powered dev.

I've been running vacuumdb --analyze and/or vaccuumdb --full between
each config change, and I also let the job run all weekend. Saturday it
got --analyze every three hours or so, Sunday it got --analyze once in
the morning. None of these vacuumdb's are making any difference.

Theories at this point, in no particular order:

a) major differences between my 7.3.4 from source (compiled with no
options) and dev's 7.3.2-1PGDG RPMs. Looking at the spec file doesn't
reveal anything glaring to me, but is there something I'm missing?

b) major differences between my kernel 2.4.18-14smp (RH8) and dev's
kernel 2.4.18-3 (RH7.3).

c) phase of the moon.

While SQL optimization is likely to improve performance across the
board, it doesn't explain the differences between these two systems and
I'd like to avoid it as a theory until the fast box can perform as well
as the slow box.

Any ideas? Thanks in advance,
--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

Tom Lane

Date:

08 December 2003, 15:21:38

Jack Coates <jack@lyris.com> writes:
> Theories at this point, in no particular order:

> a) major differences between my 7.3.4 from source (compiled with no
> options) and dev's 7.3.2-1PGDG RPMs. Looking at the spec file doesn't
> reveal anything glaring to me, but is there something I'm missing?

There are quite a few performance-related patches between 7.3.2 and
7.3.4.  Most of them should be in 7.3.4's favor but there are some
places where we had to take a performance hit in order to have a
suitably low-risk fix for a bug.  You haven't told us enough about
the problem to know if any of those cases apply, though.  AFAIR
you have not actually showed either the slow query or EXPLAIN ANALYZE
results for it on the two boxes ...

            regards, tom lane

Re: tuning questions

From

Jack Coates

Date:

09 December 2003, 12:58:49

On Mon, 2003-12-08 at 11:19, Tom Lane wrote:
> Jack Coates <jack@lyris.com> writes:
> > Theories at this point, in no particular order:
>
> > a) major differences between my 7.3.4 from source (compiled with no
> > options) and dev's 7.3.2-1PGDG RPMs. Looking at the spec file doesn't
> > reveal anything glaring to me, but is there something I'm missing?
>
> There are quite a few performance-related patches between 7.3.2 and
> 7.3.4.  Most of them should be in 7.3.4's favor but there are some
> places where we had to take a performance hit in order to have a
> suitably low-risk fix for a bug.  You haven't told us enough about
> the problem to know if any of those cases apply, though.  AFAIR
> you have not actually showed either the slow query or EXPLAIN ANALYZE
> results for it on the two boxes ...
>
>             regards, tom lane

Right, because re-architecture of a cross-platform query makes sense if
performance is bad on all systems, but is questionable activity when
performance is fine on some systems and lousy on others. Hence my
statement that while SQL optimization is certainly something we want to
do for across-the-board performance increase, I wanted to focus on other
issues for troubleshooting this problem. I will be back to ask about
data access models later :-)

I ended up going back to a default postgresql.conf and reapplying the
various tunings one-by-one. Turns out that while setting fsync = false
had little effect on the slow IDE box, it had a drastic effect on this
faster SCSI box and performance is quite acceptable now (aside from the
expected falloff of about 30% after the first twenty minutes, which I
believe comes from growing and shrinking tables without vacuumdb
--analyzing).

--
Jack Coates, Lyris Technologies Applications Engineer
510-549-4350 x148, jack@lyris.com
"Interoperability is the keyword, uniformity is a dead end."
                --Olivier Fourdan

Re: tuning questions

From

"Matt Clark"

Date:

09 December 2003, 13:08:47

> I ended up going back to a default postgresql.conf and reapplying the
> various tunings one-by-one. Turns out that while setting fsync = false
> had little effect on the slow IDE box, it had a drastic effect on this
> faster SCSI box and performance is quite acceptable now (aside from the
> expected falloff of about 30% after the first twenty minutes, which I
> believe comes from growing and shrinking tables without vacuumdb
> --analyzing).

Hmm.  I wonder if that could be related to the issue where many IDE drives have write-caching enabled.  With the write
cacheenabled 
fsyncs are nearly immediate, so setting fsync=false makes little difference...

Re: tuning questions

From

Josh Berkus

Date:

09 December 2003, 13:36:48

Jack,

> Right, because re-architecture of a cross-platform query makes sense if
> performance is bad on all systems, but is questionable activity when
> performance is fine on some systems and lousy on others. Hence my
> statement that while SQL optimization is certainly something we want to
> do for across-the-board performance increase, I wanted to focus on other
> issues for troubleshooting this problem. I will be back to ask about
> data access models later :-)

Yes, but an EXPLAIN ANALYZE will also help show issues like sorts running out
of memory, etc.   Really, we don't currently have enough information to do
more than speculate; it's like trying to repair a car engine wearing a
blindfold.

Particularly since it's possible that there are only 1 or 2 "bad queries"
which are messing everything else up.

For that matter, it would really help to know:
-- How many simulatneous connections are running update queries during this
process?
-- How about some sample VACUUM VERBOSE results for the intra-process vacuums?

> I ended up going back to a default postgresql.conf and reapplying the
> various tunings one-by-one. Turns out that while setting fsync = false
> had little effect on the slow IDE box, it had a drastic effect on this
> faster SCSI box and performance is quite acceptable now (aside from the
> expected falloff of about 30% after the first twenty minutes, which I
> believe comes from growing and shrinking tables without vacuumdb
> --analyzing).

Well, that brings 2 things immediately to mind:
1) That may improve performance, but it does mean that if your machine loses
power you *will* be restoring from backup.   It's risky to do.

2) Your IDE system has write-caching enabled.   Once again, this is a nice
performmance boost, if you don't mind database corruption in a power-out.


--
Josh Berkus
Aglio Database Solutions
San Francisco