Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug - Mailing list pgsql-general

From hubert depesz lubaczewski
Subject Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
Date
Msg-id aKSyq74zIuLS-2B5@depesz.com
Whole thread Raw
Responses Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
List pgsql-general
Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

I'm talkign about "prepare" for preparing statements. binds. also
"discard all".

We have logging enabled to csv log, which offers milisecond precicision.
For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time.

And they all have insane durations - 800-1300ms for virtually anything.

After such second, everything works fine, without any problems.

Up to next case.

Unfortunately due to short duration of such things, and the fact that
literally *everything* is paused for this 1 second, it's hard to
debug/diagnose.

Servers have memory that is almost 2x total db size (200gb vs. 384gb of
ram), so disk shouldn't be an issue.

Aside from this, we don't see any other problems.

Any idea how to look at it, what to look for, to be able to diagnose the
issue?

We do use some savepoints, but it's hard to tell when exactly they
happen (we usually log only queries that take more than 250ms, and
queries that use savepoints are usually much faster).

The DB server in question has ~ 150 connections, and handles, at this
time, we had ~ 40-50 ktps.

Logging is set using:

               name                │            setting
═══════════════════════════════════╪════════════════════════════════
 log_autovacuum_min_duration       │ 0
 log_checkpoints                   │ on
 log_connections                   │ on
 log_destination                   │ csvlog
 log_directory                     │ /cache/postgres_logs
 log_disconnections                │ off
 log_duration                      │ off
 log_error_verbosity               │ default
 log_executor_stats                │ off
 log_file_mode                     │ 0600
 log_filename                      │ postgresql-%Y-%m-%d_%H%M%S.log
 log_hostname                      │ off
 log_line_prefix                   │ db=%d,user=%u
 log_lock_waits                    │ on
 log_min_duration_sample           │ 0
 log_min_duration_statement        │ 0
 log_min_error_statement           │ error
 log_min_messages                  │ warning
 log_parameter_max_length          │ -1
 log_parameter_max_length_on_error │ 0
 log_parser_stats                  │ off
 log_planner_stats                 │ off
 log_recovery_conflict_waits       │ off
 log_replication_commands          │ off
 log_rotation_age                  │ 60
 log_rotation_size                 │ 51200
 log_statement                     │ none
 log_statement_sample_rate         │ 0.0001
 log_statement_stats               │ off
 log_temp_files                    │ 0
 log_timezone                      │ UTC
 log_transaction_sample_rate       │ 0
 log_truncate_on_rotation          │ off
 logging_collector                 │ on

and it doesn't seem to show anything interesting. Checkpoints happen,
but they don't seem correlated in any way.

Any ideas?

Best regards,

depesz




pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why analyze reports 30000 pages and rows scanned. Why not just rows?
Next
From: Scot Kreienkamp
Date:
Subject: vacuum analyze query performance - help me understand