Thread: Disconnects hanging server

Disconnects hanging server

From
Brian Wipf
Date:
We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X 10.5.1
Leopard Server and PostgreSQL 8.2.5. When we disconnect several
clients at a time (30+) in production, the CPU goes through the roof
and the server will hang for many seconds where it is completely non-
responsive. It seems the busier the server is, the longer the machine
will hang.

With an identical postgresql.conf file in the identical production
environment, our Linux 2.6.22 box running PG 8.2.5 has no problems
when disconnecting multiple clients. Also, our prior G5 Xserve running
Mac OS X Server 10.4.9 and PG 8.2.4 had no issues disconnecting
multiple clients.

Using pgbench, I have been able to duplicate the issue on another
Intel Xserve running 10.5.1 on a fresh install of PG 8.2.5. PG was
compiled 64-bit using CFLAGS='-args x86_64'. The only config option
was --enable-thread-safety.

The only modifications I have made to the postgresql.conf file are as
follows:
max_connections = 175
shared_buffers = 3GB # The max supported under 10.5.1 -- After setting
shmall, shmax accordingly
checkpoint_segments = 64

I used a scale factor of 150 when initializing a database for pgbench.
If I run `pgbench -c 150 -t 5000` and kill it (cntrl-c) shortly after
launching it, but after it completes its vacuum, there is a very minor
and brief increase in CPU usage (which I didn't notice at all btw on
the Linux box). If I let pgbench run for approximately 10 minutes and
then cntrl-c it, the CPU will max out and the machine will hang.
iostat stops reporting and top stops refreshing. This lasts for a
couple seconds, then top and iostat resume. Here is what iostat showed
when I killed pgbench after approximately 10 minutes:

postgres$ iostat -n 5 1
...
         disk0           disk1           disk2           disk3
cpu     load average
     KB/t tps  MB/s     KB/t tps  MB/s     KB/t tps  MB/s     KB/t
tps  MB/s  us sy id   1m   5m   15m
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00    10.20
732  7.30   2  4 93  1.07 2.22 2.30
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.96
766  6.71   1  2 98  1.07 2.22 2.30
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.33
755  6.88   1  2 97  1.07 2.22 2.30
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.86
777  6.73   0  2 97  1.07 2.22 2.30
--> I hit ctrl-c to kill pgbench here
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.17
766  6.86   1 43 55  1.07 2.22 2.30
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.03
770  6.79   0 79 20  1.71 2.33 2.34
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.04
77  0.68   1 38 61  1.71 2.33 2.34
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.95
273  2.39   0 80 19  1.71 2.33 2.34
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00    15.03
240  3.53   1 99  1  1.71 2.33 2.34
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.69
365  3.45   1 99  0  4.05 2.80 2.51
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     0.00
0  0.00   0 100  0  4.05 2.80 2.51
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     0.00
0  0.00   0 100  0  4.05 2.80 2.51
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.50
16  0.13   0 100  0  8.85 3.82 2.87
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00    10.18
17  0.17   0 100  0  8.85 3.82 2.87
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.00
75  0.66   0 100  0  8.85 3.82 2.87
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.50
16  0.13   0 100  0  12.39 4.64 3.16
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.10
68  0.60   0 100  0  14.20 5.14 3.35
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.75
75  0.64   0 100  0  14.20 5.14 3.35
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.83
249  2.14   0 100  0  14.20 5.14 3.35
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.57
14  0.12   0 100  0  15.46 5.55 3.50
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.33
265  2.41   1 99  0  15.46 5.55 3.50
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.15
361  3.22   0 100  0  15.46 5.55 3.50
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.33
40  0.32   1 99  0  15.46 5.55 3.50
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.93
843  7.36   0 100  0  17.43 6.12 3.72
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.84
560  4.84   0 100  0  17.43 6.12 3.72
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.06
428  3.79   1 99  0  17.43 6.12 3.72
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.00
12  0.10   0 100  0  17.43 6.12 3.72
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.92
243  2.12   0 91  9  17.43 6.12 3.72
--> unit recovered here:
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00    13.11
628  8.03   0  2 97  16.03 6.02 3.69
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.00
517  4.04   0  2 97  16.03 6.02 3.69
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     8.88
511  4.43   0  2 97  16.03 6.02 3.69
     0.00   0  0.00     0.00   0  0.00     0.00   0  0.00     9.02
503  4.43   0  2 98  16.03 6.02 3.69

I installed PG 8.3 beta 3 to see if the behavior would be any
different. The CPU usage in general seemed higher in PG 8.3 beta 3,
and I still get the spike when disconnecting multiple clients. I tried
with default settings on 8.2.5 (except for a higher max_connections),
as well as with only a higher shared_buffers, and also with only a
higher checkpoint_segments. The CPU would still spike to 100 in all of
these cases, but it didn't seem to stay there as long as when
checkpoint_segments and shared_buffers are high. I suppose the only
difference may be when I'm killing pgbench.

I'm not sure if this is a bug with PostgreSQL or OS X 10.5.1. Any
suggestions on what I can do to narrow down the problem further would
be greatly appreciated.

Brian Wipf
ClickSpace Interactive Inc.
<brian@clickspace.com>


Re: Disconnects hanging server

From
"A.M."
Date:
On Dec 3, 2007, at 4:16 PM, Brian Wipf wrote:

> We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X
> 10.5.1 Leopard Server and PostgreSQL 8.2.5. When we disconnect
> several clients at a time (30+) in production, the CPU goes through
> the roof and the server will hang for many seconds where it is
> completely non-responsive. It seems the busier the server is, the
> longer the machine will hang.

You should run Shark or Instruments to determine where the system is
getting hung up. You will likely need to install developer tools. If
you need help reading the profilers' output, please join up on an
Apple list.

In my profiling of PostgreSQL under 10.4 with PostgreSQL 8.1, I found
disappointing results with bottlenecks in the mutex-locked stdio. I
suspect that the results in 10.5 may be drastically different.

Cheers,
M

Re: Disconnects hanging server

From
Brian Wipf
Date:
On 3-Dec-07, at 3:51 PM, A.M. wrote:
> On Dec 3, 2007, at 4:16 PM, Brian Wipf wrote:
>
>> We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X
>> 10.5.1 Leopard Server and PostgreSQL 8.2.5. When we disconnect
>> several clients at a time (30+) in production, the CPU goes through
>> the roof and the server will hang for many seconds where it is
>> completely non-responsive. It seems the busier the server is, the
>> longer the machine will hang.
>
> You should run Shark or Instruments to determine where the system is
> getting hung up. You will likely need to install developer tools. If
> you need help reading the profilers' output, please join up on an
> Apple list.

As per A.M.'s suggestion, I have run a time profile in Shark to get
some idea of what's going on when the server hangs when disconnecting
clients.

Nearly 100% of the CPU is going into pmap_remove_range. The stack
trace for pmap_remove_range, viewable within Shark, is:
-> pmap_remove_range
--> pmap_remove
---> vm_map_simplify
----> vm_map_remove
-----> task_terminate_internal
------> exit1
-------> exit
--------> unix_syscall64
---------> lo64_unix_scall

The call taking up the next highest amount of CPU, at 0.1%, is
AtProcExit_Buffers. And its stack trace:
-> AtProcExit_Buffers
--> shmem_exit
---> proc_exit
----> PostgresMain
-----> BackendRun
------> BackendStartup
-------> ServerLoop
--------> PostmasterMain
---------> main
----------> start

Brian Wipf
<brian@clickspace.com>
ClickSpace Interactive Inc.


Re: Disconnects hanging server

From
Tom Lane
Date:
Brian Wipf <brian@clickspace.com> writes:
> Nearly 100% of the CPU is going into pmap_remove_range. The stack
> trace for pmap_remove_range, viewable within Shark, is:
> -> pmap_remove_range
> --> pmap_remove
> ---> vm_map_simplify
> ----> vm_map_remove
> -----> task_terminate_internal
> ------> exit1
> -------> exit
> --------> unix_syscall64
> ---------> lo64_unix_scall

In case it's not obvious, this is a kernel performance bug, which
you should report to Apple.

In the meantime you might want to think about backing off your
shared_buffers setting.  I would suppose that the performance bug
is being triggered by a very large shared memory segment (you
said 3Gb right?).

            regards, tom lane