Re: BUG #9635: Wal sender process is using 100% CPU - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #9635: Wal sender process is using 100% CPU
Date
Msg-id 53691CB3.3020908@vmware.com
Whole thread Raw
In response to Re: BUG #9635: Wal sender process is using 100% CPU  (Jamie Koceniak <jkoceniak@mediamath.com>)
Responses Re: BUG #9635: Wal sender process is using 100% CPU  (Jamie Koceniak <jkoceniak@mediamath.com>)
List pgsql-bugs
On 05/06/2014 07:44 PM, Jamie Koceniak wrote:
> Hi Heikki,
>
> Attached is a trace route for one of the pids that was running at 100% cpu.
> Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until
thewal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in
ourapplication. 
>
> Here is perf top results:
> 39.49%  libz.so.1.2.7           [.] 0x000000000000301e
>   32.97%  postgres                [.] 0x00000000000cf150
>    6.50%  postgres                [.] nocachegetattr
>    2.63%  postgres                [.] btint4cmp
>    1.64%  postgres                [.] _start
>    1.40%  [unknown]               [.] 0x00007f9dfca6a34b
>    1.06%  libc-2.13.so            [.] 0x000000000012934b
>    0.94%  postgres                [.] slot_getattr
>    0.85%  libcrypto.so.1.0.0      [.] 0x000000000009f48f
>    0.69%  postgres                [.] ExecProject
>    0.67%  libz.so.1.2.7           [.] adler32
>    0.66%  postgres                [.] GetMemoryChunkSpace
>    0.58%  [kernel]                [k] copy_user_generic_string
>    0.45%  postgres                [.] heap_fill_tuple
>    0.38%  postgres                [.] execTuplesMatch
>    0.27%  postgres                [.] FunctionCall2Coll
>    0.27%  postgres                [.] ExecProcNode
>    0.25%  postgres                [.] heap_tuple_untoast_attr
>    0.22%  postgres                [.] MemoryContextAlloc
>    0.21%  postgres                [.] heap_form_minimal_tuple
>    0.21%  postgres                [.] pfree
>    0.19%  postgres                [.] heap_compute_data_size

Hmm, these results seem to include all processes, not just the WAL sender

>   Here is some data using perf record and perf report:
>
> +  33.93%        193019       postgres  postgres                  [.] 0x0000000000369e27
> +  30.06%        173825       postgres  libz.so.1.2.7             [.] 0x00000000000036c0
> +  10.21%         60618       postgres  [unknown]                 [.] 0x00007f9dfc2fd529
> +   4.81%         27275       postgres  postgres                  [.] nocachegetattr
> +   2.55%         14443       postgres  postgres                  [.] btint4cmp
> +   1.22%          6916       postgres  postgres                  [.] ExecProject
> +   1.20%          6837       postgres  postgres                  [.] _start
> +   1.10%          6495       postgres  libc-2.13.so              [.] 0x000000000012b8e9
> +   0.84%          4760       postgres  postgres                  [.] heap_fill_tuple
> +   0.74%          4265       postgres  [kernel.kallsyms]         [k] copy_user_generic_string
> +   0.71%          4001       postgres  postgres                  [.] GetMemoryChunkSpace
> +   0.66%          3881       postgres  libcrypto.so.1.0.0        [.] 0x0000000000081ec7
> +   0.63%          3673       postgres  libz.so.1.2.7             [.] adler32
> +   0.43%          2448       postgres  postgres                  [.] slot_getattr
> +   0.40%          2277       postgres  postgres                  [.] ExecProcNode
> +   0.40%          2250       postgres  postgres                  [.] heap_form_minimal_tuple
> +   0.37%          2078       postgres  postgres                  [.] heap_compute_data_size
> +   0.31%          1739       postgres  postgres                  [.] ExecScan
> +   0.27%          1779       postgres  [kernel.kallsyms]         [k] page_fault
> +   0.24%          1374       postgres  postgres                  [.] LogicalTapeWrite
> +   0.24%          4206        swapper  [kernel.kallsyms]         [k] intel_idle
> +   0.21%          1206       postgres  postgres                  [.] slot_getsomeattrs

Same here.

> Also, here are our wal settings on the master.
> Do these settings appear ok?
>
> wal_block_size                  | 8192                                     |
>   wal_buffers                     | 16MB                                     |
>   wal_keep_segments               | 5000                                     |
>   wal_level                       | hot_standby                              |
>   wal_receiver_status_interval    | 10s                                      |
>   wal_segment_size                | 16MB                                     |
>   wal_sender_delay                | 1s                                       |
>   wal_sync_method                 | fdatasync                                |
>   wal_writer_delay                | 200ms                                    |

Looks reasonable to me.

So, the strace output repeats this, with small variations in the number
of bytes written in each sendto() call:

recvfrom(10, 0x7f9dfec1f753, 5, 0, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
read(7, ..., 131072) = 131072
sendto(10, ..., 3690, 0, NULL, 0) = 3690
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3653, 0, NULL, 0) = 3653
sendto(10, ..., 3637, 0, NULL, 0) = 3637
sendto(10, ..., 3621, 0, NULL, 0) = 3621
sendto(10, ..., 3669, 0, NULL, 0) = 3669
sendto(10, ..., 3605, 0, NULL, 0) = 3605
sendto(10, ..., 53, 0, NULL, 0) = 53
getppid()                               = 5474

The read() reads 128kB of WAL from the file in pg_xlog, and it is then
sent to the client. Now, it's pretty surprising that the number of bytes
written with sendto() doesn't match the number of bytes read from the
file. Unless, you're using SSL with compression. That would also explain
the high amount of CPU time spent in libz in the perf output.

Try disabling SSL compression, with sslcompression=0 option in
primary_conninfo, or disable SSL altogether if the network is
trustworthy enough for that.

- Heikki

pgsql-bugs by date:

Previous
From: Jamie Koceniak
Date:
Subject: Re: BUG #9635: Wal sender process is using 100% CPU
Next
From: Jeff Davis
Date:
Subject: regression failure on master with --disable-integer-datetimes