Thread: Tuning Help - What did I do wrong?

Tuning Help - What did I do wrong?

From
Josh Trutwin
Date:
We have a pretty busy linux server running postgres 8.1.4, waiting to
upgrade until 8.3 to avoid dump/restoring twice.

# cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  3704217600 3592069120 112148480        0 39460864 2316271616
Swap: 2516918272   270336 2516647936

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.00GHz
stepping        : 3
cpu MHz         : 2992.795

The postgresql.conf was basically the default so I decided to
increase the cache size and a couple paramaters to make more use of
that memory - here's what I did:

shared_buffers = 16384 (was 1000)
work_mem = 16384 (was 1024)
wal_buffers = 24 (was 8)
checkpoint_segments = 5 (was 3)
effective_cache_size = 10000 (was 1000)
stats_command_string = on (was off)
stats_block_level = on (was off)
stats_row_level = on (was off)

In order to do this I had to change /proc/sys/kernel/shmmax to
536870912 (don't have /etc/sysctl)

Also, the entire cluster gets vacuumed analyzed nightly.

After making these changes, the performance on the server actually
worsened.   I slowly backed off on some of the paramaters but didn't
seem to help.

Wondering if those changes are silly?  For a server this size I
didn't think this would be problematic.

Thank you,

Josh


Re: Tuning Help - What did I do wrong?

From
"Scott Marlowe"
Date:
On 10/4/07, Josh Trutwin <josh@trutwins.homeip.net> wrote:
> We have a pretty busy linux server running postgres 8.1.4, waiting to
> upgrade until 8.3 to avoid dump/restoring twice.

You should immediate update your version to 8.1.whateverislatest.
That requires no dump / restore and it is a bug fix update.  I doubt
this problem is because you're out of date on patches, but who
knows...

> # cat /proc/meminfo
>         total:    used:    free:  shared: buffers:  cached:
> Mem:  3704217600 3592069120 112148480        0 39460864 2316271616
> Swap: 2516918272   270336 2516647936

Well, you've got plenty of memory, and a large chunk is being used as cache.

> The postgresql.conf was basically the default so I decided to
> increase the cache size and a couple paramaters to make more use of
> that memory - here's what I did:
>
> shared_buffers = 16384 (was 1000)
> work_mem = 16384 (was 1024)
> wal_buffers = 24 (was 8)
> checkpoint_segments = 5 (was 3)
> effective_cache_size = 10000 (was 1000)
> stats_command_string = on (was off)
> stats_block_level = on (was off)
> stats_row_level = on (was off)

Your changes seem reasonable.

> Also, the entire cluster gets vacuumed analyzed nightly.

You should look into running the autovacuum daemon.  for heavily used
databases nightly vacuuming may not be enough.

> After making these changes, the performance on the server actually
> worsened.   I slowly backed off on some of the paramaters but didn't
> seem to help.

Most likely turning on stats collection slowed you down a bit.

We need to see examples of what's slow, including explain analyze
output for slow queries.  Also a brief explanation of the type of load
your database server is seeing.  I.e. is it a lot of little
transactions, mostly read, batch processing, lots of users, one user,
etc...   Right now we don't have enough info to really help you.

Re: Tuning Help - What did I do wrong?

From
"Scott Marlowe"
Date:
Oh, and in addition to  my previous message, you should use tools like
vmstat, iostat and top to get an idea of what your server is doing.

What kind of drive subsystem do you have?  What kind of raid controller?  etc...

Re: Tuning Help - What did I do wrong?

From
Josh Trutwin
Date:
On Thu, 4 Oct 2007 11:19:22 -0500
"Scott Marlowe" <scott.marlowe@gmail.com> wrote:

> We need to see examples of what's slow, including explain analyze
> output for slow queries.  Also a brief explanation of the type of
> load your database server is seeing.  I.e. is it a lot of little
> transactions, mostly read, batch processing, lots of users, one
> user, etc...   Right now we don't have enough info to really help
> you.

Sorry, this server is for a few (100+?) websites so it's running
along site apache, php.  All connections to postgresql (except for
the occaional psql console login) are done from php requests, using
the same user (basically there are two users, the one php uses and
postgres).  The bulk of the activity would be reads, but
certainly inesrts/updates/deletes would be interspersed in there.
Most of the activity is done via auto-commits, not many long
transactions.

From your followup email:

> ... you should use tools like vmstat, iostat and top to get an idea
> of what your server is doing.

# vmstat
   procs                      memory    swap          io
system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs
us  sy  id
 3  1  0    268  68332  39016 2201436   0   0     3     3    4
2   3   4   2

sorry about the wrapping...

iostat is not found - will see if I can download it.  top typically
shows postmaster as the top process with 10-15% of the CPU, followed
by apache threads.

 12:01pm  up 104 days, 12:05,  2 users,  load average: 9.75, 9.30,
7.70
215 processes: 214 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.1% user,  0.0% system,  0.0% nice,  0.4% idle
Mem:  3617400K av, 3552784K used,  64616K free,      0K shrd,  37456K
buff
Swap: 2457928K av,    264K used, 2457664K free
2273664K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME
COMMAND
31797 postgres  17   0 28836  28M  1784 S       0  8.5  0.7  10:15
postmaster

> What kind of drive subsystem do you have?  What kind of raid
> controller?  etc...

Gathering more information on this - Raid is a software
RAID-1.  Some information:
I believe itI believe it
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               66G   50G   16G  76% /
/dev/sda1              15M  6.6M  8.5M  44% /boot

# cat /proc/mdstat
Personalities : [raid0] [raid1]
read_ahead 1024 sectors
md0 : active raid1 sdb3[0] sdc3[1]
      70573440 blocks [2/2] [UU]

unused devices: <none>

Thanks for your help, I'm more of a developer guy so let me know what
else is useful.

Josh

Re: Tuning Help - What did I do wrong?

From
"Scott Marlowe"
Date:
On 10/4/07, Josh Trutwin <josh@trutwins.homeip.net> wrote:
> On Thu, 4 Oct 2007 11:19:22 -0500
> "Scott Marlowe" <scott.marlowe@gmail.com> wrote:
>
> > We need to see examples of what's slow, including explain analyze
> > output for slow queries.  Also a brief explanation of the type of
> > load your database server is seeing.  I.e. is it a lot of little
> > transactions, mostly read, batch processing, lots of users, one
> > user, etc...   Right now we don't have enough info to really help
> > you.
>
> Sorry, this server is for a few (100+?) websites so it's running
> along site apache, php.  All connections to postgresql (except for
> the occaional psql console login) are done from php requests, using
> the same user (basically there are two users, the one php uses and
> postgres).  The bulk of the activity would be reads, but
> certainly inesrts/updates/deletes would be interspersed in there.
> Most of the activity is done via auto-commits, not many long
> transactions.

So, are there certain queries that are much slower than the others?
Run them from psql with explain analyze in front of them and post the
query and the output here.

> From your followup email:
>
> > ... you should use tools like vmstat, iostat and top to get an idea
> > of what your server is doing.
>
> # vmstat
>    procs                      memory    swap          io
> system         cpu
>  r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs
> us  sy  id
>  3  1  0    268  68332  39016 2201436   0   0     3     3    4
> 2   3   4   2

vmstat needs to be run for a while to give you useful numbers.   try:

vmstat 5

and let it run for a few minutes.  The first line won't count so much,
but after that you'll get more reasonable numbers.

> iostat is not found - will see if I can download it.  top typically
> shows postmaster as the top process with 10-15% of the CPU, followed
> by apache threads.

What OS are you on?

>  12:01pm  up 104 days, 12:05,  2 users,  load average: 9.75, 9.30,
> 7.70

That's pretty heavy load.  I notice there's no wait % listed for CPU,
so I assume it's not a late model Linux kernel or anything.

> 215 processes: 214 sleeping, 1 running, 0 zombie, 0 stopped
> CPU states:  0.1% user,  0.0% system,  0.0% nice,  0.4% idle
> Mem:  3617400K av, 3552784K used,  64616K free,      0K shrd,  37456K
> buff
> Swap: 2457928K av,    264K used, 2457664K free
> 2273664K cached
>
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME
> COMMAND
> 31797 postgres  17   0 28836  28M  1784 S       0  8.5  0.7  10:15
> postmaster

Are the postmasters using most of the CPU?  OR the other processes?

> > What kind of drive subsystem do you have?  What kind of raid
> > controller?  etc...
>
> Gathering more information on this - Raid is a software
> RAID-1.  Some information:

OK, given that it's read mostly, it's likely not a problem that a
faster RAID controller would help.  Possibly more drives in a RAID 10
would help a little, but let's look at optimizing your query and
postmaster first.

Do you have the postmaster configured to log long running queries?
That's a good starting point.  also google pg_fouine (I think I spelt
it right) for analyzing your logs.

It's quite likely the issue here is one long running query that
chewing all your I/O or CPU and making everything else slow.  Once we
find that query things should get better and we can worry about
performance tuning in a more leisurely manner.

Re: Tuning Help - What did I do wrong?

From
"Kevin Grittner"
Date:
>>> On Thu, Oct 4, 2007 at 10:28 AM, in message
<20071004102804.3418912e@joplin.trutwins.homeip.net>, Josh Trutwin
<josh@trutwins.homeip.net> wrote:
> running postgres 8.1.4

> # cat /proc/meminfo
>         total:    used:    free:  shared: buffers:  cached:
> Mem:  3704217600 3592069120 112148480        0 39460864 2316271616

> shared_buffers = 16384 (was 1000)
> effective_cache_size = 10000 (was 1000)

It's kind of silly to tell PostgreSQL that its total cache space is 10000
pages when you've got more than that in shared buffers plus all that OS
cache space.  Try something around 285000 pages for effective_cache_size.

> stats_command_string = on (was off)
> stats_block_level = on (was off)
> stats_row_level = on (was off)

> After making these changes, the performance on the server actually
> worsened.   I slowly backed off on some of the paramaters but didn't
> seem to help.

Did you try turning off the collection of those additional statistics?
That isn't free.

You didn't get specific about what you saw in performance problems.  If
you are seeing occasional "freezes" of all queries, you are likely looking
at a known issue with "spikiness" of disk output.  For some this can be
corrected by using very aggressive background writer settings.  Some have
solved it by disabling OS write delays.  Some haven't found a solution and
are waiting for 8.3; there have been some serious changes made to attempt
to resolve this issue.

-Kevin




Re: Tuning Help - What did I do wrong?

From
Josh Trutwin
Date:
On Thu, 04 Oct 2007 14:03:07 -0500
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> It's kind of silly to tell PostgreSQL that its total cache space is
> 10000 pages when you've got more than that in shared buffers plus
> all that OS cache space.  Try something around 285000 pages for
> effective_cache_size.

Good point.

> > stats_command_string = on (was off)
> > stats_block_level = on (was off)
> > stats_row_level = on (was off)
>
> > After making these changes, the performance on the server actually
> > worsened.   I slowly backed off on some of the paramaters but
> > didn't seem to help.
>
> Did you try turning off the collection of those additional
> statistics? That isn't free.

I turned off all but row level since I decided to try turning
autovacuum on.

> You didn't get specific about what you saw in performance
> problems.  If you are seeing occasional "freezes" of all queries,
> you are likely looking at a known issue with "spikiness" of disk
> output.  For some this can be corrected by using very aggressive
> background writer settings.  Some have solved it by disabling OS
> write delays.  Some haven't found a solution and are waiting for
> 8.3; there have been some serious changes made to attempt to
> resolve this issue.

Thanks - I put some additional information in replies to Scott, but
mainly the performance of the web sites that are talking to postgres
is the problem - people calling in, etc.  I turned on slow query
logging to see if I can find if it's particular queries or something
else?

Josh