WAL accumulating, Logical Replication pg 13 - Mailing list pgsql-general
From | Willy-Bas Loos |
---|---|
Subject | WAL accumulating, Logical Replication pg 13 |
Date | |
Msg-id | CAHnozTiua7QqvFjKcu4LFBpwtt_DmaTeyiUzrCnn6xsLhkxbSQ@mail.gmail.com Whole thread Raw |
Responses |
Re: WAL accumulating, Logical Replication pg 13
|
List | pgsql-general |
Hi , I'm upgrading a 1.5TB database from postgres 9.3 to postgres 13 on Debian 10. This is now in an Acceptance stage (DTAP). I have encountered a problem: the WAL is not being deleted. I now have 1.4 TB of WAL in pg_wal and my disks are getting full. The oldest WAL file is 18 days old.
I use Logical Replication from the new cluster to another new cluster with 1 subscriber and 1 subscription.
pg_stat_subscription tells me all recent timestamps.
and this:
db=# select * from pg_replication_slots;
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |
-[ RECORD 1 ]-------+-------------
slot_name | my_pub1
plugin | pgoutput
slot_type | logical
datoid | 16401
database | db
temporary | f
active | t
active_pid | 9480
xmin |
catalog_xmin | 269168
restart_lsn | D4/908BC268
confirmed_flush_lsn | E1/25BF5710
wal_status | extended
safe_wal_size |
I've had problems with diskspace on this server, with postgres crashing because of it, then added more diskspace and postgres recovered. This doesn't seem to be a problem now.
The publication has the options publish = 'insert, update, delete, truncate', publish_via_partition_root = false
The subscription has the options connect = true, enabled = true, create_slot = false, slot_name = my_pub1, synchronous_commit = 'off'
The log on the publisher says:
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: starting logical decoding for slot "my_pub1"
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot "my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot "my_pub1" is active for PID 4584
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: Streaming transactions committing after D6/A82B5FE0, reading WAL from D4/908BC268.
2021-05-25 21:25:18.973 CEST [4584] user@db LOG: logical decoding found consistent point at D4/908BC268
2021-05-25 21:25:18.973 CEST [4584] user@db DETAIL: There are no running transactions.
2021-05-25 21:29:49.456 CEST [4614] user@db ERROR: replication slot "my_pub1" is active for PID 4584
2021-05-25 21:29:54.474 CEST [4615] user@db ERROR: replication slot "my_pub1" is active for PID 4584
And on the subscriber:
2021-05-28 21:23:46.702 CEST [40039] LOG: logical replication apply worker for subscription "my_pub1" has started
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical replication worker" (PID 40039) exited with exit code 1
2021-05-28 21:23:46.712 CEST [40039] ERROR: could not start WAL streaming: ERROR: replication slot "my_pub1" is active for PID 730
2021-05-28 21:23:46.714 CEST [19794] LOG: background worker "logical replication worker" (PID 40039) exited with exit code 1
The postgres settings on the publisher are:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 50GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 1GB
min_wal_size = 80MB
#archive_mode = off
max_wal_senders = 10 # max number of walsender processes
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots
wal_sender_timeout = 1min # in milliseconds; 0 disables
max_replication_slots = 7 # max number of replication slots
On postgres settings on the subscriber:
max_connections = 100 # (change requires restart)
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
tcp_keepalives_idle = 120 # TCP_KEEPIDLE, in seconds;
shared_buffers = 25GB # min 128kB
work_mem = 1GB # min 64kB
maintenance_work_mem = 10GB # min 1MB
logical_decoding_work_mem = 5GB # min 64kB
dynamic_shared_memory_type = posix # the default is the first option
max_worker_processes = 20 # (change requires restart)
max_parallel_maintenance_workers = 10 # taken from max_parallel_workers
max_parallel_workers_per_gather = 5 # taken from max_parallel_workers
max_parallel_workers = 15 # maximum number of max_worker_processes that
wal_level = logical # minimal, replica, or logical
max_wal_size = 3GB
min_wal_size = 80MB
#archive_mode = off
wal_receiver_timeout = 1min # time that receiver waits for
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from max_logical_replication_workers
max_logical_replication_workers = 10 # taken from max_worker_processes
max_sync_workers_per_subscription = 5 # taken from max_logical_replication_workers
I've tried increasing wal_sender_timeout and wal_receiver_timeout to 10 minutes each, but this had no positive effect.
Some advice would be helpful
--
Willy-Bas Loos
pgsql-general by date: