On Fri, Jun 5, 2020 at 7:07 AM Krzysztof Olszewski <kolszew73@gmail.com> wrote:
I have problem with one of my Postgres production server. Server works fine almost always, but sometimes without any increase of transactions or statements amount, machine gets stuck. Cores goes up to 100%, load up to 160%. When it happens then there are problems with connect to database and even it will succeed, simple queries works several seconds instead of milliseconds.Problem sometimes stops after a period a time (e.g. 35 min), sometimes we must restart Postgres, Linux, or even KVM (which exists as virtualization host).
My hardware 56 cores (Intel Core Processor (Skylake, IBRS)) 400 GB RAM RAID10 with about 40k IOPS
Set to 1 (same as default seq_page_cost) for a moment and try it.
On normal state, i have about 500 tps, 5% usage of cores, about 3% of load, whole database fits in memory, no reads from disk, only writes on about 500 IOPS level, sometimes in spikes on 1500 IOPS level, but on this hardware there is no problem with this values (no iowaits on cores). In normal state this machine does "nothing". Connections to database are created by two app servers based on Java, through connection pools, so connections count is limited by configuration of pools and max is 120, is lower value than in Postgres configuration (150). On normal state there is about 20 connections, when stuck goes into max (120).
In correlation with stucks i see informations in kernel log about NMI watchdog: BUG: soft lockup - CPU#25 stuck for 23s! [postmaster:33935] but i don't know this is reason or effect of problem I made investigation with pgBadger and ... nothing strange happens, just normal statements