Thread: Problems with PostgreSQL on Google compute engine

Problems with PostgreSQL on Google compute engine

From
Josef Machytka
Date:
Hi guys,

I am not sure if this problem is really related to PostgreSQL but maybe someone could have some idea?

We run several Debian instances with PostgreSQL on Google compute engine and lately we have already seen several occurrences of the following problem.

Instance becomes suddenly non responsive. We cannot ssh it and we cannot connect to the database. Internal monitoring using telegraf is also not running during that period, no monitoring data collected.

Google monitoring of CPU activity shows very low usage during that period. GCP logs do not show any migration in fact do not show anything at all. Also all internal logs for instance - postgresql log, syslog, logs from periodical cronjobs - show the same gap. Looks like the instance was sort of frozen during that time. We so far noticed it only with PostgreSQL instances since these are heavily used.

Instances run these variants of OS and PG
Debian 9 with PG 11.9
Debian 9 with PG 10.13

These incidents usually take 10-15 minutes, but in one case it was 1:20 hours. At the end of the incident some PG process is killed by an OOM killer but activity on the database immediately before the incident starts is usually relatively low, CPU usage and memory usage too. So it looks more like an instance has limited resources when it starts again? If it is even possible...

Any idea what could be the cause of these issues or what shall we look for? As I mentioned generally no info in internal logs on Debian during the period of the incident.

Thanks.

Josef Machytka