Thread: Odd OS, SAN or related throughput issue affecting large streamer

Odd OS, SAN or related throughput issue affecting large streamer

From
Jerry Sievers
Date:
Greetings!

I do not know if $issue has anything to do w/Pg directly but would be
very grateful for any insights...

We've got ~20 servers on a beefy physical host that's using a
fibre-channel storage array backend.

Runtime and OS software was updated recently, and about 2 days ago, our
big monster system began lagging in replication.  It generally is able
to stream WALs from the primary, but is no longer able to apply them and
the slowdown is orders of magnitute below what it was prior.

Full reboot of the host restores adequate throughput for several hours,
upon which time the backlogging resumes.  We have whitnessed the reboot
as temp fix and then repeated falloff twice now on consecutive days.

System of interest is a churny, ~50TB warehouse w/4 tablespaces.  I'
unclear on whether all or just some of them are sluggish on writes.

It is "replay" lag, not lag in streaming that is evident.

my SysEng team is so far unable to spot what's causing the issue.

Suggestions re where to look next?

Thx!


postgres=# select version();
                                                                    version
                      
 

-----------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.11 (Ubuntu 11.11-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12)
5.4.020160609, 64-bit
 

$ uname -a
Linux foobox.foocorp.com 4.15.0-139-generic #143~16.04.1-Ubuntu SMP Wed Mar 17 08:10:33 UTC 2021 x86_64 x86_64 x86_64
GNU/Linux

-- 
Jerry Sievers
Postgres DBA/Development Consulting



Re: Odd OS, SAN or related throughput issue affecting large streamer

From
Johannes Truschnigg
Date:
Hi Jerry.

On Tue, Mar 23, 2021 at 12:25:35PM -0500, Jerry Sievers wrote:
> Greetings!
>
> [...]
> Suggestions re where to look next?

Have you looked at the usual OS-level metrics yet?

Does the process that applies the streamed WAL to the secondary's database
exhibit high CPU load when the slowdown starts becoming noticeable? Check by
using `top` and `mpstat`.

Does I/O latency spike/increase during that time? Maybe your system is
beginning to page data in and out? Check by using `iostat` and `vmstat`.

If you can identify a/the metric that changes when your troubles begin to
manifest, we should be able to help you drill down to the root cause of it all
much better.

--
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   https://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-email or attachments. Thank you.

Attachment