Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug - Mailing list pgsql-general
From | hubert depesz lubaczewski |
---|---|
Subject | Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug |
Date | |
Msg-id | aKSyq74zIuLS-2B5@depesz.com Whole thread Raw |
Responses |
Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
|
List | pgsql-general |
Hi, we have following situation: Pg 14.17 (yes, I know, but it can't be upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using arm64 architecture. All works, is fine. Every now and then (usually every 3-5 minutes, but not through the whole day), we see situations where every query suddently takes ~ 1 second. I'm talkign about "prepare" for preparing statements. binds. also "discard all". We have logging enabled to csv log, which offers milisecond precicision. For ~ 1 second there are no logs going to log (we usually have at 5-20 messages logged per second), no connection, nothing. And then we get bunch (30+) messages with the same milisecond time. And they all have insane durations - 800-1300ms for virtually anything. After such second, everything works fine, without any problems. Up to next case. Unfortunately due to short duration of such things, and the fact that literally *everything* is paused for this 1 second, it's hard to debug/diagnose. Servers have memory that is almost 2x total db size (200gb vs. 384gb of ram), so disk shouldn't be an issue. Aside from this, we don't see any other problems. Any idea how to look at it, what to look for, to be able to diagnose the issue? We do use some savepoints, but it's hard to tell when exactly they happen (we usually log only queries that take more than 250ms, and queries that use savepoints are usually much faster). The DB server in question has ~ 150 connections, and handles, at this time, we had ~ 40-50 ktps. Logging is set using: name │ setting ═══════════════════════════════════╪════════════════════════════════ log_autovacuum_min_duration │ 0 log_checkpoints │ on log_connections │ on log_destination │ csvlog log_directory │ /cache/postgres_logs log_disconnections │ off log_duration │ off log_error_verbosity │ default log_executor_stats │ off log_file_mode │ 0600 log_filename │ postgresql-%Y-%m-%d_%H%M%S.log log_hostname │ off log_line_prefix │ db=%d,user=%u log_lock_waits │ on log_min_duration_sample │ 0 log_min_duration_statement │ 0 log_min_error_statement │ error log_min_messages │ warning log_parameter_max_length │ -1 log_parameter_max_length_on_error │ 0 log_parser_stats │ off log_planner_stats │ off log_recovery_conflict_waits │ off log_replication_commands │ off log_rotation_age │ 60 log_rotation_size │ 51200 log_statement │ none log_statement_sample_rate │ 0.0001 log_statement_stats │ off log_temp_files │ 0 log_timezone │ UTC log_transaction_sample_rate │ 0 log_truncate_on_rotation │ off logging_collector │ on and it doesn't seem to show anything interesting. Checkpoints happen, but they don't seem correlated in any way. Any ideas? Best regards, depesz
pgsql-general by date: