Thread: Finding bottleneck

Finding bottleneck

From
Kari Lavikka
Date:
Hello,

we recently upgraded our dual Xeon Dell to a brand new Sun v40z with 4
opterons, 16GB of memory and MegaRAID with enough disks. OS is Debian
Sarge amd64, PostgreSQL is 8.0.3. Size of database is something like 80GB
and our website performs about 600 selects and several updates/inserts a
second.

v40z performs somewhat better than our old Dell but mostly due to
increased amount of memory. The problem is.. there seems to by plenty of
free CPU available and almost no IO-wait but CPU bound queries seem to
linger for some reason. Problem appears very clearly during checkpointing.
Queries accumulate and when checkpointing is over, there can be something
like 400 queries running but over 50% of cpu is just idling.

procs -----------memory------------ ---swap-- -----io---- --system-- ----cpu----
  r  b   swpd   free   buff    cache   si   so    bi    bo   in    cs us sy id wa
  3  1      0 494008 159492 14107180    0    0   919  3164 3176 13031 29 12 52  8
  5  3      0 477508 159508 14118452    0    0  1071  4479 3474 13298 27 13 47 13
  0  0      0 463604 159532 14128832    0    0   922  2903 3352 12627 29 11 52  8
  3  1      0 442260 159616 14141668    0    0  1208  3153 3357 13163 28 12 52  9

An example of a lingering query (there's usually several of these or similar):

SELECT u.uid, u.nick, u.name, u.showname, i.status, i.stamp, i.image_id,
i.info, i.t_width, i.t_height FROM users u INNER JOIN image i ON i.uid =
u.uid INNER JOIN user_online uo ON u.uid = uo.uid WHERE u.city_id = 5 AND
i.status = 'd' AND u.status = 'a' ORDER BY city_id, upper(u.nick) LIMIT
(40 + 1) OFFSET 320

Tables involved contain no more than 4 million rows. Those are constantly
accessed and should fit nicely to cache. But database is just slow because
of some unknown reason. Any ideas?

----------------->8 Relevant rows from postgresql.conf 8<-----------------

shared_buffers = 15000          # min 16, at least max_connections*2, 8KB each
work_mem = 1536                 # min 64, size in KB
maintenance_work_mem = 32768    # min 1024, size in KB

max_fsm_pages = 1000000         # min max_fsm_relations*16, 6 bytes each
max_fsm_relations = 5000        # min 100, ~50 bytes each

vacuum_cost_delay = 15          # 0-1000 milliseconds
vacuum_cost_limit = 120         # 0-10000 credits

bgwriter_percent = 2            # 0-100% of dirty buffers in each round

fsync = true                    # turns forced synchronization on or off
                                 # fsync, fdatasync, open_sync, or open_datasync
wal_buffers = 128               # min 4, 8KB each
commit_delay = 80000            # range 0-100000, in microseconds
commit_siblings = 10            # range 1-1000

checkpoint_segments = 200       # in logfile segments, min 1, 16MB each
checkpoint_timeout = 1800       # range 30-3600, in seconds

effective_cache_size = 1000000  # typically 8KB each
random_page_cost = 1.8          # units are one sequential page fetch cost

default_statistics_target = 150 # range 1-1000

stats_start_collector = true
stats_command_string = true

     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""

Re: Finding bottleneck

From
Gavin Sherry
Date:
Hi,

On Thu, 28 Jul 2005, Kari Lavikka wrote:

> ----------------->8 Relevant rows from postgresql.conf 8<-----------------
>
> shared_buffers = 15000          # min 16, at least max_connections*2, 8KB each
> work_mem = 1536                 # min 64, size in KB

As an aside, I'd increase work_mem -- but it doesn't sound like that is
your problem.

> maintenance_work_mem = 32768    # min 1024, size in KB
>
> max_fsm_pages = 1000000         # min max_fsm_relations*16, 6 bytes each
> max_fsm_relations = 5000        # min 100, ~50 bytes each
>
> vacuum_cost_delay = 15          # 0-1000 milliseconds
> vacuum_cost_limit = 120         # 0-10000 credits
>
> bgwriter_percent = 2            # 0-100% of dirty buffers in each round
>
> fsync = true                    # turns forced synchronization on or off
>                                  # fsync, fdatasync, open_sync, or open_datasync
> wal_buffers = 128               # min 4, 8KB each

Some benchmarking results out today suggest that wal_buffers = 1024 or
even 2048 could greatly assist you.

> commit_delay = 80000            # range 0-100000, in microseconds
> commit_siblings = 10            # range 1-1000

This may explain the fact that you've got backed up queries and idle CPU
-- I'm not certain though. What does disabling commit_delay do to your
situation?

Gavin

Re: Finding bottleneck

From
Claus Guttesen
Date:
> effective_cache_size = 1000000  # typically 8KB each

I have this setting on postgresql 7.4.8 on FreeBSD with 4 GB RAM:

effective_cache_size = 27462

So eventhough your machine runs Debian and you have four times as much
RAM as mine your effective_cache_size is 36 times larger. You could
try lowering this setting.

regards
Claus

Re: Finding bottleneck

From
"Merlin Moncure"
Date:
Kari Lavikka wrote:
> shared_buffers = 15000
you can play around with this one but in my experience it doesn't make
much difference anymore (it used to).

> work_mem = 1536                 # min 64, size in KB
this seems low.  are you sure you are not getting sorts swapped to disk?

> fsync = true                    # turns forced synchronization on or
off
does turning this to off make a difference?  This would help narrow down
where the problem is.

> commit_delay = 80000            # range 0-100000, in microseconds
hm! how did you arrive at this number?  try setting to zero and
comparing.

> stats_start_collector = true
> stats_command_string = true
with a high query load you may want to consider turning this off.  On
win32, I've had some problem with stat's collector under high load
conditions.  Not un unix, but it's something to look at.  Just turn off
stats for a while and see if it helps.

good luck! your hardware should be more than adequate.

Merlin

Re: Finding bottleneck

From
"Luke Lonergan"
Date:
On 7/28/05 2:21 AM, "Kari Lavikka" <tuner@bdb.fi> wrote:

There's a new profiling tool called oprofile:

 http://oprofile.sourceforge.net/download/

that can be run without instrumenting the binaries beforehand.  To actually
find out what the code is doing during these stalls, oprofile can show you
in which routines the CPU is spending time when you start/stop the
profiling.

As an alternative to the "guess->change parameters->repeat" approach, this
is the most direct way to find the exact nature of the problem.

- Luke



Re: Finding bottleneck

From
Kari Lavikka
Date:
Hi!

Oprofile looks quite interesting. I'm not very familiar with postgresql
internals, but here's some report output:

CPU: AMD64 processors, speed 2190.23 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
13513390 16.0074  AtEOXact_CatCache
4492257   5.3213  StrategyGetBuffer
2279285   2.6999  AllocSetAlloc
2121509   2.5130  LWLockAcquire
2023574   2.3970  hash_seq_search
1971358   2.3352  nocachegetattr
1837168   2.1762  GetSnapshotData
1793693   2.1247  SearchCatCache
1777385   2.1054  hash_search
1460804   1.7304  ExecMakeFunctionResultNoSets
1360930   1.6121  _bt_compare
1344604   1.5928  yyparse
1318407   1.5617  LWLockRelease
1290814   1.5290  FunctionCall2
1137544   1.3475  ExecEvalVar
1102236   1.3057  hash_any
912677    1.0811  OpernameGetCandidates
877993    1.0400  ReadBufferInternal
783908    0.9286  TransactionIdPrecedes
772886    0.9155  MemoryContextAllocZeroAligned
679768    0.8052  StrategyBufferLookup
609339    0.7218  equal
600584    0.7114  PGSemaphoreLock

And btw, I tried to strace lingering queries under different loads. When
number of concurrent queries increases, lseek and read syscalls stay
within quite constant limits but number of semop calls quadruples.

Are there some buffer locking issues?

     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""

On Thu, 28 Jul 2005, Luke Lonergan wrote:

> On 7/28/05 2:21 AM, "Kari Lavikka" <tuner@bdb.fi> wrote:
>
> There's a new profiling tool called oprofile:
>
> http://oprofile.sourceforge.net/download/
>
> that can be run without instrumenting the binaries beforehand.  To actually
> find out what the code is doing during these stalls, oprofile can show you
> in which routines the CPU is spending time when you start/stop the
> profiling.
>
> As an alternative to the "guess->change parameters->repeat" approach, this
> is the most direct way to find the exact nature of the problem.
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: Finding bottleneck

From
Tom Lane
Date:
Kari Lavikka <tuner@bdb.fi> writes:
> samples  %        symbol name
> 13513390 16.0074  AtEOXact_CatCache

That seems quite odd --- I'm not used to seeing that function at the top
of a profile.  What is the workload being profiled, exactly?

            regards, tom lane

Re: Finding bottleneck

From
"Merlin Moncure"
Date:
> Kari Lavikka <tuner@bdb.fi> writes:
> > samples  %        symbol name
> > 13513390 16.0074  AtEOXact_CatCache
>
> That seems quite odd --- I'm not used to seeing that function at the
top
> of a profile.  What is the workload being profiled, exactly?

He is running a commit_delay of 80000.  Could that be playing a role?

Merlin

Re: Finding bottleneck

From
Tom Lane
Date:
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
>> Kari Lavikka <tuner@bdb.fi> writes:
>>> samples  %        symbol name
>>> 13513390 16.0074  AtEOXact_CatCache
>>
>> That seems quite odd --- I'm not used to seeing that function at the top
>> of a profile.  What is the workload being profiled, exactly?

> He is running a commit_delay of 80000.  Could that be playing a role?

It wouldn't cause AtEOXact_CatCache to suddenly get expensive.  (I have
little or no faith in the value of nonzero commit_delay, though.)

            regards, tom lane

Re: Finding bottleneck

From
Kari Lavikka
Date:
Actually I modified postgresql.conf a bit and there isn't commit delay any
more. That didn't make noticeable difference though..

Workload is generated by a website with about 1000 dynamic page views a
second. Finland's biggest site among youths btw.

Anyway, there are about 70 tables and here's some of the most important:
              relname              |  reltuples
----------------------------------+-------------
  comment                          | 1.00723e+08
  comment_archive                  | 9.12764e+07
  channel_comment                  | 6.93912e+06
  image                            | 5.80314e+06
  admin_event                      |  5.1936e+06
  user_channel                     | 3.36877e+06
  users                            |      325929
  channel                          |      252267

Queries to "comment" table are mostly IO-bound but are performing quite
well. Here's an example:

(SELECT u.nick, c.comment, c.private, c.admin, c.visible, c.parsable,
c.uid_sender, to_char(c.stamp, 'DD.MM.YY HH24:MI') AS stamp, c.comment_id
FROM comment c INNER JOIN users u ON u.uid = c.uid_sender WHERE u.status =
'a' AND c.image_id = 15500900 AND c.uid_target = 780345 ORDER BY
uid_target DESC, image_id DESC, c.comment_id DESC) LIMIT 36

And explain analyze:
  Limit  (cost=0.00..6.81 rows=1 width=103) (actual time=0.263..17.522 rows=12 loops=1)
    ->  Nested Loop  (cost=0.00..6.81 rows=1 width=103) (actual time=0.261..17.509 rows=12 loops=1)
          ->  Index Scan Backward using comment_uid_target_image_id_comment_id_20050527 on "comment" c
(cost=0.00..3.39rows=1 width=92) (actual time=0.129..16.213 rows=12 loops=1) 
                Index Cond: ((uid_target = 780345) AND (image_id = 15500900))
          ->  Index Scan using users_pkey on users u  (cost=0.00..3.40 rows=1 width=15) (actual time=0.084..0.085
rows=1loops=12) 
                Index Cond: (u.uid = "outer".uid_sender)
                Filter: (status = 'a'::bpchar)
  Total runtime: 17.653 ms


We are having performance problems with some smaller tables and very
simple queries. For example:

SELECT u.uid, u.nick, extract(epoch from uc.stamp) AS stamp FROM
user_channel uc INNER JOIN users u USING (uid) WHERE channel_id = 281321
AND u.status = 'a' ORDER BY uc.channel_id, upper(uc.nick)

And explain analyze:
  Nested Loop  (cost=0.00..200.85 rows=35 width=48) (actual time=0.414..38.128 rows=656 loops=1)
    ->  Index Scan using user_channel_channel_id_nick on user_channel uc  (cost=0.00..40.18 rows=47 width=27) (actual
time=0.090..0.866rows=667 loops=1) 
          Index Cond: (channel_id = 281321)
    ->  Index Scan using users_pkey on users u  (cost=0.00..3.40 rows=1 width=25) (actual time=0.048..0.051 rows=1
loops=667)
          Index Cond: ("outer".uid = u.uid)
          Filter: (status = 'a'::bpchar)
  Total runtime: 38.753 ms

Under heavy load these queries tend to take several minutes to execute
although there's plenty of free cpu available. There aren't any blocking
locks in pg_locks.

     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""

On Mon, 8 Aug 2005, Merlin Moncure wrote:

>> Kari Lavikka <tuner@bdb.fi> writes:
>>> samples  %        symbol name
>>> 13513390 16.0074  AtEOXact_CatCache
>>
>> That seems quite odd --- I'm not used to seeing that function at the
> top
>> of a profile.  What is the workload being profiled, exactly?
>
> He is running a commit_delay of 80000.  Could that be playing a role?
>
> Merlin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

Re: Finding bottleneck

From
Tom Lane
Date:
Kari Lavikka <tuner@bdb.fi> writes:
> We are having performance problems with some smaller tables and very
> simple queries. For example:

> SELECT u.uid, u.nick, extract(epoch from uc.stamp) AS stamp FROM
> user_channel uc INNER JOIN users u USING (uid) WHERE channel_id = 281321
> AND u.status = 'a' ORDER BY uc.channel_id, upper(uc.nick)

> And explain analyze:
>   Nested Loop  (cost=0.00..200.85 rows=35 width=48) (actual time=0.414..38.128 rows=656 loops=1)
>     ->  Index Scan using user_channel_channel_id_nick on user_channel uc  (cost=0.00..40.18 rows=47 width=27) (actual
time=0.090..0.866rows=667 loops=1) 
>           Index Cond: (channel_id = 281321)
>     ->  Index Scan using users_pkey on users u  (cost=0.00..3.40 rows=1 width=25) (actual time=0.048..0.051 rows=1
loops=667)
>           Index Cond: ("outer".uid = u.uid)
>           Filter: (status = 'a'::bpchar)
>   Total runtime: 38.753 ms

> Under heavy load these queries tend to take several minutes to execute
> although there's plenty of free cpu available.

What that sounds like to me is a machine with inadequate disk I/O bandwidth.
Your earlier comment that checkpoint drives the machine into the ground
fits right into that theory, too.  You said there is "almost no IO-wait"
but are you sure you are measuring that correctly?

Something else just struck me from your first post:

> Queries accumulate and when checkpointing is over, there can be
> something like 400 queries running but over 50% of cpu is just idling.

400 queries?  Are you launching 400 separate backends to do that?
Some sort of connection pooling seems like a good idea, if you don't
have it in place already.  If the system's effective behavior in the
face of heavy load is to start even more concurrent backends, that
could easily drive things into the ground.

            regards, tom lane

Re: Finding bottleneck

From
Kari Lavikka
Date:
On Mon, 8 Aug 2005, Tom Lane wrote:
> What that sounds like to me is a machine with inadequate disk I/O bandwidth.
> Your earlier comment that checkpoint drives the machine into the ground
> fits right into that theory, too.  You said there is "almost no IO-wait"
> but are you sure you are measuring that correctly?

Currently there's some iowait caused by "fragmentation" of the comment
table. Periodic clustering helps a lot.

Disk configurations looks something like this:
   sda: data (10 spindles, raid10)
   sdb: xlog & clog (2 spindles, raid1)
   sdc: os and other stuff

Usually iostat (2 second interval) says:
   avg-cpu:  %user   %nice    %sys %iowait   %idle
             32.38    0.00   12.88   11.62   43.12

   Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
   sda             202.00      1720.00         0.00       3440          0
   sdb             152.50         4.00      2724.00          8       5448
   sdc               0.00         0.00         0.00          0          0

And during checkpoint:
   avg-cpu:  %user   %nice    %sys %iowait   %idle
             31.25    0.00   14.75   54.00    0.00

   Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
   sda            3225.50      1562.00     35144.00       3124      70288
   sdb             104.50        10.00      2348.00         20       4696
   sdc               0.00         0.00         0.00          0          0

I think (insufficiency of) disk IO shouldn't cause those lingering queries
because dataset is rather small and it's continuously accessed. It should
fit into cache and stay there(?)

> 400 queries?  Are you launching 400 separate backends to do that?

Well yes. That's the common problem with php and persistent connections.

> Some sort of connection pooling seems like a good idea, if you don't
> have it in place already.

pg_pool for example? I'm planning to give it a try.

>             regards, tom lane


     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""


Re: Finding bottleneck

From
Tom Lane
Date:
Kari Lavikka <tuner@bdb.fi> writes:
> Disk configurations looks something like this:
>    sda: data (10 spindles, raid10)
>    sdb: xlog & clog (2 spindles, raid1)
>    sdc: os and other stuff

That's definitely wrong.  Put clog on the data disk.  The entire point
of giving xlog its own spindle is that you don't ever want the disk
heads moving off the current xlog file.  I'm not sure how much this is
hurting you, given that clog is relatively low volume, but if you're
going to go to the trouble of putting xlog on a separate spindle then
it should be a completely dedicated spindle.

            regards, tom lane

Re: Finding bottleneck

From
Ron
Date:
I think I have a solution for you.

You have posted that you presently have these RAID volumes and behaviors:
   sda: data (10 spindles, raid10)
   sdb: xlog & clog (2 spindles, raid1)
   sdc: os and other stuff

Usually iostat (2 second interval) says:
avg-cpu: %user  %nice  %sys  %iowait  %idle
               32.38    0.00   12.88    11.62    43.12

Device:       tps     kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
   sda      202.00      1720.00         0.00           3440          0
   sdb      152.50         4.00         2724.00          8           5448
   sdc         0.00         0.00            0.00             0             0

And during checkpoint:
avg-cpu:  %user   %nice  %sys  %iowait  %idle
                31.25      0.00   14.75   54.00    0.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda            3225.50      1562.00     35144.00       3124      70288
sdb             104.50        10.00      2348.00         20       4696
sdc               0.00         0.00         0.00          0          0


During checkpoints sda is becoming saturated, essentially halting all
other DB activity involving sda.  A lesser version of the porblem is
probably occurring every time multiple entities on sda are being
accessed simultaneously, particularly simultaneous writes.

My Proposed Solution:
Put comment and its index on it's own dedicated RAID volume.
Put comment_archive and its index on its own dedicated RAID volume.
Put the rest of the tables currently part of "data" on their own
dedicated RAID volume.
Put the rest if the indexes to the tables currently part of "data" on
their own dedicated RAID volume.
Put xlog on its own dedicated RAID volume.

The general idea here is to put any tables or indexes that tend to
require simultaneous access, particularly write access, on different
spindles.  Like all things, there's a point of diminishing returns
that is dependent on the HW used and the DB load.

If you must wring every last bit of IO out of the HD subsystem, a
more exact set of spindle assignments can be made by analyzing your
queries and then 1) make sure writes that tend to be simultaneous are
to different spindles, then (if you still need better IO) 2) make
sure reads that tend to be simultaneous are to different
spindles.  At some point, your controller will become the
bottleneck.  At some point beyond that, the IO channels on the
mainboard will become the bottleneck.

My suggestion should get you to within 80-90% of optimal if I've
understood the implications of your posts correctly.

The other suggestion I'd make is to bump your RAM from 16GB to 32GB
as soon as you can afford it and then tune your PostgreSQL parameters
to make best use of it.  The more RAM resident your DB, the better.

Hope this helps,
Ron Peacetree


===========Original Message Follows===========
From: Kari Lavikka <tuner ( at ) bdb ( dot ) fi>
To: Merlin Moncure <merlin ( dot ) moncure ( at ) rcsonline ( dot ) com>
Subject: Re: Finding bottleneck
Date: Mon, 8 Aug 2005 19:19:09 +0300 (EETDST)

----------

Actually I modified postgresql.conf a bit and there isn't commit
delay any more. That didn't make noticeable difference though..

Workload is generated by a website with about 1000 dynamic page views
a second. Finland's biggest site among youths btw.


Anyway, there are about 70 tables and here's some of the most important:
              relname              |  reltuples
----------------------------------+-------------
  comment                          | 1.00723e+08
  comment_archive                  | 9.12764e+07
  channel_comment                  | 6.93912e+06
  image                            | 5.80314e+06
  admin_event                      |  5.1936e+06
  user_channel                     | 3.36877e+06
  users                            |      325929
  channel                          |      252267

Queries to "comment" table are mostly IO-bound but are performing
quite well. Here's an example:
(SELECT u.nick, c.comment, c.private, c.admin, c.visible, c.parsable,
c.uid_sender, to_char(c.stamp, 'DD.MM.YY HH24:MI') AS stamp,
c.comment_id FROM comment c INNER JOIN users u ON u.uid =
c.uid_sender WHERE u.status = 'a' AND c.image_id = 15500900 AND
c.uid_target = 780345 ORDER BY uid_target DESC, image_id DESC,
c.comment_id DESC) LIMIT 36


And explain analyze:
  Limit  (cost=0.00..6.81 rows=1 width=103) (actual
time=0.263..17.522 rows=12 loops=1)
    ->  Nested Loop  (cost=0.00..6.81 rows=1 width=103) (actual
time=0.261..17.509 rows=12 loops=1)
          ->  Index Scan Backward using
comment_uid_target_image_id_comment_id_20050527 on "comment"
c  (cost=0.00..3.39 rows=1 width=92) (actual time=0.129..16.213
rows=12 loops=1)
                Index Cond: ((uid_target = 780345) AND (image_id = 15500900))
          ->  Index Scan using users_pkey on users
u  (cost=0.00..3.40 rows=1 width=15) (actual time=0.084..0.085 rows=1 loops=12)
                Index Cond: (u.uid = "outer".uid_sender)
                Filter: (status = 'a'::bpchar)
  Total runtime: 17.653 ms


We are having performance problems with some smaller tables and very
simple queries. For example:
SELECT u.uid, u.nick, extract(epoch from uc.stamp) AS stamp FROM
user_channel uc INNER JOIN users u USING (uid) WHERE channel_id =
281321 AND u.status = 'a' ORDER BY uc.channel_id, upper(uc.nick)


And explain analyze:
  Nested Loop  (cost=0.00..200.85 rows=35 width=48) (actual
time=0.414..38.128 rows=656 loops=1)
    ->  Index Scan using user_channel_channel_id_nick on user_channel
uc  (cost=0.00..40.18 rows=47 width=27) (actual time=0.090..0.866
rows=667 loops=1)
          Index Cond: (channel_id = 281321)
    ->  Index Scan using users_pkey on users u  (cost=0.00..3.40
rows=1 width=25) (actual time=0.048..0.051 rows=1 loops=667)
          Index Cond: ("outer".uid = u.uid)
          Filter: (status = 'a'::bpchar)
  Total runtime: 38.753 ms

Under heavy load these queries tend to take several minutes to
execute although there's plenty of free cpu available. There aren't
any blocking locks in pg_locks.


     |\__/|
     ( oo )    Kari Lavikka - tuner ( at ) bdb ( dot ) fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""




Re: Finding bottleneck

From
Kari Lavikka
Date:
On Mon, 8 Aug 2005, Tom Lane wrote:
> What that sounds like to me is a machine with inadequate disk I/O bandwidth.
> Your earlier comment that checkpoint drives the machine into the ground
> fits right into that theory, too.  You said there is "almost no IO-wait"
> but are you sure you are measuring that correctly?

Reducing checkpoint_timeout to 600 seconds had a positive effect. Previous
value was 1800 seconds.

We have a spare disk array from the old server and I'm planning to use it
as a tablespace for the comment table (the 100M+ rows one) as Ron
suggested.

>> Queries accumulate and when checkpointing is over, there can be
>> something like 400 queries running but over 50% of cpu is just idling.
>
> 400 queries?  Are you launching 400 separate backends to do that?
> Some sort of connection pooling seems like a good idea, if you don't
> have it in place already.  If the system's effective behavior in the
> face of heavy load is to start even more concurrent backends, that
> could easily drive things into the ground.

Ok, I implemented connection pooling using pgpool and it increased
performance a lot! We are now delivering about 1500 dynamic pages a second
without problems. Each of the eight single-cpu webservers are running a
pgpool instance with 20 connections.

However, those configuration changes didn't have significant effect to
oprofile results. AtEOXact_CatCache consumes even more cycles. This isn't
a problem right now but it may be in the future...

CPU: AMD64 processors, speed 2190.23 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
1147870  21.1602  AtEOXact_CatCache
187466    3.4558  hash_seq_search
174357    3.2142  AllocSetAlloc
170896    3.1504  nocachegetattr
131724    2.4282  ExecMakeFunctionResultNoSets
125292    2.3097  SearchCatCache
117264    2.1617  StrategyDirtyBufferList
105741    1.9493  hash_search
98245     1.8111  FunctionCall2
97878     1.8043  yyparse
90932     1.6763  LWLockAcquire
83555     1.5403  LWLockRelease
81045     1.4940  _bt_compare
... and so on ...

----->8 Signigicant rows from current postgresql.conf 8<-----

max_connections = 768           # unnecessarily large with connection
pooling
shared_buffers = 15000
work_mem = 2048
maintenance_work_mem = 32768
max_fsm_pages = 1000000
max_fsm_relations = 5000
bgwriter_percent = 2
fsync = true
wal_buffers = 512
checkpoint_segments = 200       # less would probably be enuff with 600sec
timeout
checkpoint_timeout = 600
effective_cache_size = 500000
random_page_cost = 1.5
default_statistics_target = 150
stats_start_collector = true
stats_command_string = true


     |\__/|
     ( oo )    Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO(  )Ooo_______ _____ ___ _ _  _   _    _      _                  _
       ""


Re: Finding bottleneck

From
Tom Lane
Date:
Kari Lavikka <tuner@bdb.fi> writes:
> However, those configuration changes didn't have significant effect to
> oprofile results. AtEOXact_CatCache consumes even more cycles.

I believe I've fixed that for 8.1.

            regards, tom lane

Re: Finding bottleneck

From
"Merlin Moncure"
Date:
> Kari Lavikka <tuner@bdb.fi> writes:
> > However, those configuration changes didn't have significant effect
to
> > oprofile results. AtEOXact_CatCache consumes even more cycles.
>
> I believe I've fixed that for 8.1.

Relative to 8.0, I am seeing a dramatic, almost miraculous reduction in
CPU load times in 8.1devel.  This is for ISAM style access patterns over
the parse/bind interface.  (IOW one record at a time, 90% read, 10%
write).

Relative to commercial dedicated ISAM storage engines, pg holds up very
well except in cpu load, but 8.1 is a huge step towards addressing that.

So far, except for one minor (and completely understandable) issue with
bitmap issues, 8.1 has been a stellar performer.  Also great is the
expansion of pg_locks view (which I didn't see mentioned in Bruce's TODO
list, just FYI).

Merlin

Re: Finding bottleneck

From
Tom Lane
Date:
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> Relative to 8.0, I am seeing a dramatic, almost miraculous reduction in
> CPU load times in 8.1devel.  This is for ISAM style access patterns over
> the parse/bind interface.  (IOW one record at a time, 90% read, 10%
> write).

> Relative to commercial dedicated ISAM storage engines, pg holds up very
> well except in cpu load, but 8.1 is a huge step towards addressing that.

Cool --- we've done a fair amount of work on squeezing out internal
inefficiencies during this devel cycle, but it's always hard to predict
just how much anyone will notice in the real world.

Care to do some oprofile or gprof profiles to see where it's still bad?

            regards, tom lane

Re: Finding bottleneck

From
"Merlin Moncure"
Date:
> Cool --- we've done a fair amount of work on squeezing out internal
> inefficiencies during this devel cycle, but it's always hard to
predict
> just how much anyone will notice in the real world.
>
> Care to do some oprofile or gprof profiles to see where it's still
bad?
>

Since release of 8.0, we are a strictly windows shop :).  I tried
building pg with -pg flag and got errors in some of the satellite
libraries.  I think this is solvable though at some point I'll spend
more time on it.  Anyways, just so you know the #s that I'm seein, I've
run several benchmarks of various programs that access pg via our ISAM
bridge.  The results are as consistent as they are good.  These tests
are on the same box using the same .conf on the same freshly loaded
data.  The disk doesn't play a major role in these tests.  All data
access is through ExecPrepared libpq C interface.  Benchmark is run from
a separate box on a LAN.

Bill of Materials Traversal ( ~ 62k records).

             ISAM*      pg 8.0     pg 8.1 devel   delta 8.0->8.1
running time 63 sec     90 secs    71 secs        21%
cpu load     17%        45%        32%            29%
loadsecs**   10.71      40.5       22.72          44%
recs/sec     984        688        873
recs/loadsec 5882       1530       2728

*ISAM is an anonymous commercial ISAM library in an optimized server
architecture (pg smokes the non-optimized flat file version).
**Loadsecs being seconds of CPU at 100% load.


IOW cpu load drop is around 44%.  Amazing!

Merlin



Re: Finding bottleneck

From
"Merlin Moncure"
Date:
> Bill of Materials Traversal ( ~ 62k records).
>
>              ISAM*      pg 8.0     pg 8.1 devel   delta 8.0->8.1
> running time 63 sec     90 secs    71 secs        21%
> cpu load     17%        45%        32%            29%
> loadsecs**   10.71      40.5       22.72          44%
> recs/sec     984        688        873
> recs/loadsec 5882       1530       2728
>
> *ISAM is an anonymous commercial ISAM library in an optimized server
> architecture (pg smokes the non-optimized flat file version).
> **Loadsecs being seconds of CPU at 100% load.

One thing that might interest you is that the penalty in 8.1 for
stats_command_string=true in this type of access pattern is very high: I
was experimenting to see if the new cpu efficiency gave me enough of a
budget to start using this.  This more than doubled the cpu load to
around 70% with a runtime of 82 seconds.  This is actually worse than
8.0 :(.

This *might* be a somewhat win32 specific issue.  I've had issues with
the stats collector before.  Anyways, the feature is a frill so it's not
a big deal.

Merlin



Re: Finding bottleneck

From
Tom Lane
Date:
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> One thing that might interest you is that the penalty in 8.1 for
> stats_command_string=true in this type of access pattern is very high: I
> was experimenting to see if the new cpu efficiency gave me enough of a
> budget to start using this.  This more than doubled the cpu load to
> around 70% with a runtime of 82 seconds.  This is actually worse than
> 8.0 :(.

That seems quite peculiar; AFAICS the pgstat code shouldn't be any
slower than before.  At first I thought it might be because we'd
increased PGSTAT_ACTIVITY_SIZE, but actually that happened before
8.0 release, so it shouldn't be a factor in this comparison.

Can anyone else confirm a larger penalty for stats_command_string in
HEAD than in 8.0?  A self-contained test case would be nice too.

            regards, tom lane

Re: Finding bottleneck

From
"Merlin Moncure"
Date:
> That seems quite peculiar; AFAICS the pgstat code shouldn't be any
> slower than before.  At first I thought it might be because we'd
> increased PGSTAT_ACTIVITY_SIZE, but actually that happened before
> 8.0 release, so it shouldn't be a factor in this comparison.

Just FYI the last time I looked at stats was in the 8.0 beta period.

> Can anyone else confirm a larger penalty for stats_command_string in
> HEAD than in 8.0?  A self-contained test case would be nice too.

looking into it.

Merlin