Thread: High System CPU Usage on Selects Seemingly Caused By Vacuum of Same Table

High System CPU Usage on Selects Seemingly Caused By Vacuum of Same Table

From
Joshua Banton
Date:
Hello,

We are experiencing a high system cpu issue with our database that we can't quite explain.

Description of the issue:
The database instance cpu suddenly increases to over 90%, the majority of it as reported by RDS Enhanced Monitoring being system cpu, as opposed to coming from Postgres itself. IOPS usage barely changes, plenty of headroom.

The issue mostly manifests near the end of the "scanning heap" phase of vacuuming of one of our largest tables, we'll call table1. RDS Performance Insights reports that selects on table1 start to wait on cpu, where previously it didn't even show up in the top 25 queries by wait. It doesn't always happen, but if there is a larger than usual number of selects on table1 it is more likely to happen.


Example Timeline:
00:00 vacuum starts on table1 (has at least 10 million dead tuples as reported by pg_stat_user_tables)
02:00 database cpu p90 12%, RDS Performance Insights CPU waits 11%,
IOPS under 100k, table1 not in top 25 waits in RDS Performance Insights, vacuum reports 6GB out of 3TB remaining to scan
02:01 database cpu p90 98%, RDS Performance Insights CPU waits 73%,
IOPS under 100k, table1 selects now #1, #2, #6 in RDS Performance Insights making up the majority of the cpu waits
02:42 vacuum on table1 finishes, database cpu p90 11%, RDS Performance Insights CPU waits 12%,
IOPS under 100k, table1 not in top 25 waits in RDS Performance Insights


Installation Details:
  AWS RDS Managed Instance
  Postgres Version: PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-17), 64-bit
  Instance Class: r7i.48xlarge
  vCPU: 192
  RAM: 1536 GB
  Storage Type: Provisioned IOPS SSD (io2), 180000 IOPS
  Application Language/Framework: Ruby on Rails
  Pooler: PgBouncer 1.17.0
 
Config:
|name                                       |current_setting                        
|-------------------------------------------|---------------------------------------
|autovacuum_analyze_scale_factor            |0.005                                  
|autovacuum_freeze_max_age                  |500000000                              
|autovacuum_max_workers                     |24                                    
|autovacuum_naptime                         |15s                                    
|autovacuum_vacuum_cost_delay               |1ms                                    
|autovacuum_vacuum_cost_limit               |2000                                  
|autovacuum_vacuum_scale_factor             |0.01                                  
|autovacuum_work_mem                        |48756174kB                            
|checkpoint_completion_target               |0.9                                    
|checkpoint_timeout                         |1h                                    
|default_toast_compression                  |lz4                                    
|effective_cache_size                       |780098792kB                            
|enable_presorted_aggregate                 |on                                    
|fsync                                      |on                                    
|full_page_writes                           |on                                    
|gss_accept_delegation                      |off                                    
|hot_standby                                |off                                    
|huge_pages                                 |on                                    
|icu_validation_level                       |error                                  
|idle_in_transaction_session_timeout        |30s                                    
|ignore_invalid_pages                       |off                                    
|jit                                        |off                                    
|lo_compat_privileges                       |off                                    
|log_autovacuum_min_duration                |0                                      
|log_checkpoints                            |on                                    
|log_connections                            |on                                    
|log_disconnections                         |on                                    
|log_min_duration_statement                 |10ms                                  
|log_rotation_age                           |1h                                    
|log_temp_files                             |0                                      
|log_timezone                               |UTC                                    
|log_truncate_on_rotation                   |off                                    
|logging_collector                          |on                                    
|maintenance_work_mem                       |25577009kB                            
|max_connections                            |5000                                  
|max_locks_per_transaction                  |64                                    
|max_logical_replication_workers            |10                                    
|max_parallel_apply_workers_per_subscription|2                                      
|max_parallel_workers                       |96                                    
|max_prepared_transactions                  |0                                      
|max_replication_slots                      |10                                    
|max_slot_wal_keep_size                     |400000MB                              
|max_stack_depth                            |6MB                                    
|max_standby_archive_delay                  |-1                                    
|max_standby_streaming_delay                |2147483647ms                          
|max_wal_senders                            |25                                    
|max_wal_size                               |200000MB                              
|max_worker_processes                       |20                                    
|min_wal_size                               |128MB                                  
|random_page_cost                           |1.05                                  
|recovery_init_sync_method                  |syncfs                                
|recovery_prefetch                          |off                                    
|remove_temp_files_after_crash              |off                                    
|reserved_connections                       |2                                      
|scram_iterations                           |4096                                  
|send_abort_for_crash                       |off                                    
|send_abort_for_kill                        |off                                    
|session_preload_libraries                  |                                      
|shared_buffers                             |390049392kB                            
|shared_preload_libraries                   |rdsutils,pg_tle,pg_stat_statements    
|synchronous_commit                         |on                                    
|track_activity_query_size                  |4kB                                    
|track_functions                            |pl                                    
|track_io_timing                            |on                                    
|track_wal_io_timing                        |on                                    
|vacuum_buffer_usage_limit                  |399410kB                              
|vacuum_cost_page_miss                      |5                                      
|vacuum_failsafe_age                        |1200000000                            
|vacuum_multixact_failsafe_age              |1200000000                            
|wal_compression                            |lz4                                    
|wal_keep_size                              |400000MB                              
|wal_level                                  |logical                                
|wal_receiver_create_temp_slot              |off                                    
|wal_receiver_timeout                       |15min                                  
|wal_sender_timeout                         |2min                                  

Large Table Stats:
table1:
    rows: 2,303,386,400
    disk space: 3.5T
    rel size: 3T
    FSM size: 758 MB
    visibility map size: 94 MB
   
table2:
    rows: 1,919,360,500
    disk space: 1.6T
    rel size: 758G
    FSM size: 189 MB
    visibility map size: 24 MB
   
table3:
    rows: 2,412,903,000
    disk space: 741G
    rel size: 387G
    FSM size: 96 MB
    visibility map size: 12 MB


Any suggestions or insights greatly appreciated!

Joshua Banton
On Fri, Jan 31, 2025 at 5:28 PM Joshua Banton <bantonj@gmail.com> wrote:
> The issue mostly manifests near the end of the "scanning heap" phase of vacuuming of one of our largest tables, we'll
calltable1. RDS Performance Insights reports that selects on table1 start to wait on cpu, where previously it didn't
evenshow up in the top 25 queries by wait. It doesn't always happen, but if there is a larger than usual number of
selectson table1 it is more likely to happen. 

Does this database also have many tables? As in thousands of tables?

I am reminded of this issue:

https://www.postgresql.org/message-id/flat/da3205c4-5b07-a65c-6c26-a293c6464fdb%40postgrespro.ru

I've heard of this happening when an aggressive VACUUM updates
relfrozenxid on a larger table.

--
Peter Geoghegan



There are not that many tables, 175. 

There are a decent number of transactions on the database, 35k a second during the day. I ran a vacuum verbose on table1 today and got this output on the pages frozen.

frozen: 2813436 pages from table (0.70% of total) had 7028123 tuples frozen



On Fri, Jan 31, 2025 at 5:46 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jan 31, 2025 at 5:28 PM Joshua Banton <bantonj@gmail.com> wrote:
> The issue mostly manifests near the end of the "scanning heap" phase of vacuuming of one of our largest tables, we'll call table1. RDS Performance Insights reports that selects on table1 start to wait on cpu, where previously it didn't even show up in the top 25 queries by wait. It doesn't always happen, but if there is a larger than usual number of selects on table1 it is more likely to happen.

Does this database also have many tables? As in thousands of tables?

I am reminded of this issue:

https://www.postgresql.org/message-id/flat/da3205c4-5b07-a65c-6c26-a293c6464fdb%40postgrespro.ru

I've heard of this happening when an aggressive VACUUM updates
relfrozenxid on a larger table.

--
Peter Geoghegan