Re: BUG #9635: Wal sender process is using 100% CPU - Mailing list pgsql-bugs

From Jamie Koceniak
Subject Re: BUG #9635: Wal sender process is using 100% CPU
Date
Msg-id 84CE67AB46ACDA41B18094177DD0853D212A5A@ORD2MBX04C.mex05.mlsrvr.com
Whole thread Raw
In response to Re: BUG #9635: Wal sender process is using 100% CPU  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: BUG #9635: Wal sender process is using 100% CPU  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-bugs
Hi Heikki,

Attached is a trace route for one of the pids that was running at 100% cpu.
Note: when the wal_senders hit 100% cpu, our slave servers tend to lag the master for up to around 3-4 minutes until
thewal_sender cpu utilization drops. This is very problematic for us since we are trying to use the slaves for reads in
ourapplication.
 

Here is perf top results:
39.49%  libz.so.1.2.7           [.] 0x000000000000301e
 32.97%  postgres                [.] 0x00000000000cf150
  6.50%  postgres                [.] nocachegetattr
  2.63%  postgres                [.] btint4cmp
  1.64%  postgres                [.] _start
  1.40%  [unknown]               [.] 0x00007f9dfca6a34b
  1.06%  libc-2.13.so            [.] 0x000000000012934b
  0.94%  postgres                [.] slot_getattr
  0.85%  libcrypto.so.1.0.0      [.] 0x000000000009f48f
  0.69%  postgres                [.] ExecProject
  0.67%  libz.so.1.2.7           [.] adler32
  0.66%  postgres                [.] GetMemoryChunkSpace
  0.58%  [kernel]                [k] copy_user_generic_string
  0.45%  postgres                [.] heap_fill_tuple
  0.38%  postgres                [.] execTuplesMatch
  0.27%  postgres                [.] FunctionCall2Coll
  0.27%  postgres                [.] ExecProcNode
  0.25%  postgres                [.] heap_tuple_untoast_attr
  0.22%  postgres                [.] MemoryContextAlloc
  0.21%  postgres                [.] heap_form_minimal_tuple
  0.21%  postgres                [.] pfree
  0.19%  postgres                [.] heap_compute_data_size

 Here is some data using perf record and perf report:

+  33.93%        193019       postgres  postgres                  [.] 0x0000000000369e27
+  30.06%        173825       postgres  libz.so.1.2.7             [.] 0x00000000000036c0
+  10.21%         60618       postgres  [unknown]                 [.] 0x00007f9dfc2fd529
+   4.81%         27275       postgres  postgres                  [.] nocachegetattr
+   2.55%         14443       postgres  postgres                  [.] btint4cmp
+   1.22%          6916       postgres  postgres                  [.] ExecProject
+   1.20%          6837       postgres  postgres                  [.] _start
+   1.10%          6495       postgres  libc-2.13.so              [.] 0x000000000012b8e9
+   0.84%          4760       postgres  postgres                  [.] heap_fill_tuple
+   0.74%          4265       postgres  [kernel.kallsyms]         [k] copy_user_generic_string
+   0.71%          4001       postgres  postgres                  [.] GetMemoryChunkSpace
+   0.66%          3881       postgres  libcrypto.so.1.0.0        [.] 0x0000000000081ec7
+   0.63%          3673       postgres  libz.so.1.2.7             [.] adler32
+   0.43%          2448       postgres  postgres                  [.] slot_getattr
+   0.40%          2277       postgres  postgres                  [.] ExecProcNode
+   0.40%          2250       postgres  postgres                  [.] heap_form_minimal_tuple
+   0.37%          2078       postgres  postgres                  [.] heap_compute_data_size
+   0.31%          1739       postgres  postgres                  [.] ExecScan
+   0.27%          1779       postgres  [kernel.kallsyms]         [k] page_fault
+   0.24%          1374       postgres  postgres                  [.] LogicalTapeWrite
+   0.24%          4206        swapper  [kernel.kallsyms]         [k] intel_idle
+   0.21%          1206       postgres  postgres                  [.] slot_getsomeattrs

Also, here are our wal settings on the master.
Do these settings appear ok?

wal_block_size                  | 8192                                     | 
 wal_buffers                     | 16MB                                     | 
 wal_keep_segments               | 5000                                     | 
 wal_level                       | hot_standby                              | 
 wal_receiver_status_interval    | 10s                                      | 
 wal_segment_size                | 16MB                                     | 
 wal_sender_delay                | 1s                                       | 
 wal_sync_method                 | fdatasync                                | 
 wal_writer_delay                | 200ms                                    |

Thanks!

-----Original Message-----
From: Heikki Linnakangas [mailto:hlinnakangas@vmware.com] 
Sent: Tuesday, May 06, 2014 4:01 AM
To: Jamie Koceniak
Cc: pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #9635: Wal sender process is using 100% CPU

On 05/05/2014 10:37 PM, Jamie Koceniak wrote:
> Hi Heikki,
>
> I performed an strace on one of the pids that was consuming 100% of the cpu.
> This is just a portion of the log.
> I also trimmed out some data provided in the quotes in the read and sends, not sure if you need it.
> A majority of the calls in the trace log were recvfrom() and sendto().
> EAGAIN (Resource temporarily unavailable) came up time and time again.
>
> # strace log during 100% cpu utilization on the master

Hmm, that trace looks normal to me. It doesn't look like it's busy-looping, as I would expect if it's using 100% of the
CPU.

Can you post the full log, or at least a larger chunk? I don't need the data in the quotes from read() and sends() that
youremoved, but I would like to see the trace from a longer period of time to see what the pattern is.
 

- Heikki

Attachment

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Problem with PostgreSQL 9.2.7 and make check on AIX 7.1
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #9635: Wal sender process is using 100% CPU