why would postgres be throttling a streaming replication slot'ssending? - Mailing list pgsql-general
From | Chris Withers |
---|---|
Subject | why would postgres be throttling a streaming replication slot'ssending? |
Date | |
Msg-id | 1a8891b4-10ca-6545-79bc-95becb8c17d4@withers.org Whole thread Raw |
In response to | Re: finding out what's generating WALs (Achilleas Mantzios <achill@matrix.gatewaynet.com>) |
Responses |
Re: why would postgres be throttling a streaming replication slot'ssending?
|
List | pgsql-general |
On 11/12/2018 14:48, Achilleas Mantzios wrote: > On 11/12/18 4:00 μ.μ., Chris Withers wrote: >> >> I'm looking after a multi-tenant PG 9.4 cluster, and we've started >> getting alerts for the number of WALs on the server. >> It'd be great to understand what's generating all that WAL and what's >> likely to be causing any problems.\ > > Regarding you wals in pg_wal, a good threshold could be anything more > than a e.g. 10% increase from wal_keep_segments with a trend to go up. > If this number goes up chances are something bad is happening. wal_keep_segments is 0 here, so I went hunting in pg_replication_slots and found that it's the barman slot that's behind on the WAL. strace'ing on the barman receiving side shows it waiting on the socket, so no problem on that side. on the sending side, it's a little confusing, since the server is basically idle, no cpu or disk activity, and yet strace'ing the sending process attached to the other end of the socket shows time being spent waiting on a poll which, while it includes the socket being sent to, also includes a bunch of pipes. I've attached a chunk of the output below in the hope that someone on this list could offer an explanation as to what might cause the WAL to be trickling over this port rather than catching up as fast as it can? cheers, Chris strace output for the streaming_barman slot servicing the barman replication slot: --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} --- write(4, "\0", 1) = 1 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) read(3, "\0", 16) = 1 read(6, 0x7fffffffdcd7, 1) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) read(5, "\3\1\0\0\7\326\355\2\343\0\0\0@\n\0\0\0\17\1\230\313\4\0\0)\347\372l\0\0\0\0"..., 4272) = 4272 sendto(10, "d\0\0\20\315w\0\0\4\313\230\1\17P\0\0\4\313\230\1 \0\0\2\37\321\343\345\306\20\3\1"..., 4302, 0, NULL, 0) = 4302 read(3, 0x7fffffffdc90, 16) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6, events=POLLIN}], 3, 26076) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} --- write(4, "\0", 1) = 1 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) read(3, "\0", 16) = 1 read(6, 0x7fffffffdcd7, 1) = -1 EAGAIN (Resource temporarily unavailable) recvfrom(10, 0x7fffffffdcaf, 1, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) read(5, "~\320\5\0\3\0\0\0\0 \1\230\313\4\0\0\3\0\0\0\0\0\0\0\323\316\314\0\0\0\0\0"..., 8192) = 8192 sendto(10, "d\0\0 \35w\0\0\4\313\230\1 \0\0\0\4\313\230\1@\0\0\2\37\321\343\361\376\226~\320"..., 8222, 0, NULL, 0) = 8222 read(3, 0x7fffffffdc90, 16) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=10, events=POLLIN}, {fd=3, events=POLLIN}, {fd=6, events=POLLIN}], 3, 25275) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=3577, si_uid=26} --- Interesting numbers: - process 3577 is the wal writer process - fd 10 is the socket, 3 and 6 are pipes - time is visibly spent sitting on that poll - the network is only seeing 2Mbit/sec, which is nothing cheers, Chris
pgsql-general by date: