Thread: BUG #18666: Standby cluster does not start after building from Primary cluster.

BUG #18666: Standby cluster does not start after building from Primary cluster.

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18666
Logged by:          Андрій Лозовський
Email address:      andrew.lozovsky@gmail.com
PostgreSQL version: 16.4
Operating system:   Ubuntu 20.04.5 LTS
Description:

Standby cluster does not start after building from Primary cluster. 

The copy2slave16.sh script was used for the building Standby cluster:
#!/bin/bash
. /home/postgres/pg.sh
LD_LIBRARY_PATH=$PGLIB $PGBIN/psql -Upgadmin -d postgres -f
/home/postgres/copy2slave16.sql > /home/postgres/copy.log 2>
/home/postgres/copy.err

copy2slave16.sql:
SELECT pg_backup_start(label => 'label', fast => true);
\! rsync -a /opt/pgdata-16/ <Standby_cluster_IP>:/opt/pgdata-16/ --exclude
postmaster.pid  --exclude pg_wal/  --exclude pg_replslot/ >
/home/postgres/rsync.log 2> /home/postgres/rsync.err
SELECT * FROM pg_backup_stop(wait_for_archive => true);

There was also an attempt to building the Standby cluster using
pg_basebackup.
Problem was a same.

The following messages are indicated in the log file:

2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  starting PostgreSQL
16.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)
9.4.0, 64-bit
2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  listening on IPv4
address "0.0.0.0", port 5432
2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  listening on IPv6
address "::", port 5432
2024-10-21 10:38:32.677 EEST [2816535] [] [] [] []LOG:  listening on Unix
socket "/tmp/.s.PGSQL.5432"
2024-10-21 10:38:32.685 EEST [2816540] [] [] [] []LOG:  database system was
interrupted; last known up at 2024-10-21 06:58:37 EEST
cp: cannot stat '/opt/archivelog/00000002.history': No such file or
directory
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []LOG:  entering standby
mode
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []LOG:  invalid checkpoint
record
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []PANIC:  could not locate a
valid checkpoint record
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  startup process (PID
2816540) was terminated by signal 6: Aborted
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  aborting startup due
to startup process failure
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  [pg_stat_monitor]
pgsm_shmem_shutdown: Shutdown initiated.
2024-10-21 10:38:32.872 EEST [2816535] [] [] [] []LOG:  database system is
shut down

The pg_hba.conf file contains all the settings for the primary and standby
clusters.
The standby.signal file is created in the data directory of the standby
cluster.

# CUSTOMIZED OPTIONS in postgresql.conf on Standby cluster
port = 5432
listen_addresses = '*'
superuser_reserved_connections = 3
log_timezone = 'Europe/Kiev'
timezone = 'Europe/Kiev'
logging_collector = on
log_directory = '/opt/pglogs'
#log_filename = 'p2p_unk-%a.log'
#log_filename = 'p2p_unp-%a.log'
log_filename = 'p2p_una-%a.log'
log_truncate_on_rotation = on
log_line_prefix = '%m [%p] [%a] [%u] [%d] [%h]'
log_statement = 'none'              # none, ddl, mod, all
log_connections = on
autovacuum = on
shared_preload_libraries = 'pg_stat_statements,pg_stat_monitor'
synchronous_commit = on
max_locks_per_transaction = 256
cluster_name = 'p2p_un'

max_connections = 200
shared_buffers = 16GB
effective_cache_size = 32GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.2
effective_io_concurrency = 200
work_mem = 16MB
huge_pages = try
min_wal_size = 2GB
max_wal_size = 8GB
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
log_min_duration_statement = 1s#min_time_query_in_log

archive_mode = on
wal_level = replica                     #logical
max_replication_slots = 14              #кількість підписок
max_wal_senders = 15                    #max_replication_slots+1
max_logical_replication_workers = 15    #max_replication_slots + резерв
max_worker_processes = 16               #max_logical_replication_workers +
1
wal_keep_size = 32

#archive_command = 'rsync -aq %p
postgres@<Standby_cluster_IP>:/opt/archivelog/%f'#AWS
#archive_command = 'rsync -aq %p
postgres@<Primary_cluster_IP>:/opt/archivelog/%f'#AWS
archive_command = 'cp %p /dev/null'

hot_standby = on

primary_conninfo = 'user=pgadmin host=<Primary_cluster_IP> port=5432
user=pgadmin sslmode=prefer sslcompression=1' # Amazon
#primary_conninfo = 'user=pgadmin host=<Standby_cluster_IP> port=5432
user=pgadmin sslmode=prefer sslcompression=1' # Amazon
recovery_target_timeline = 'latest'
restore_command = 'cp /opt/archivelog/%f %p'
archive_cleanup_command = '/usr/pgsql-16/bin/pg_archivecleanup
/opt/archivelog %r'
#primary_slot_name = 'standby_slot_unk'
#primary_slot_name = 'standby_slot_unp'
primary_slot_name = 'standby_slot_una'

log_lock_waits = 'on'
deadlock_timeout = '100ms'

# Parameters of the Server claster.
OS - Ubuntu 20.04.5 LTS (Focal Fossa)
CPU - 16
RAM - 64 Gb
DB Size - 6.7T


PG Bug reporting form <noreply@postgresql.org> writes:
> Standby cluster does not start after building from Primary cluster. 

I think your archiving setup is broken:

> archive_command = 'cp %p /dev/null'
> ...
> restore_command = 'cp /opt/archivelog/%f %p'

BTW, "cp" is widely regarded as not being safe enough for production
archiving, because it doesn't fsync.  But this archive_command
isn't even trying.  So it's not surprising if there's not enough WAL
to recover from.

            regards, tom lane



copy2slave16.sql:
SELECT pg_backup_start(label => 'label', fast => true);
\! rsync -a /opt/pgdata-16/ <Standby_cluster_IP>:/opt/pgdata-16/ --exclude
postmaster.pid  --exclude pg_wal/  --exclude pg_replslot/ >
/home/postgres/rsync.log 2> /home/postgres/rsync.err
SELECT * FROM pg_backup_stop(wait_for_archive => true);

How is the archive_command parameter configured on your primary? I see primary and standby nodes aren't deployed on the same host, so is it set up to send WAL files to the /opt/archivelog directory on the standby machine? 

Regards,
Pixian Shi

PG Bug reporting form <noreply@postgresql.org> 于2024年10月21日周一 19:33写道:
The following bug has been logged on the website:

Bug reference:      18666
Logged by:          Андрій Лозовський
Email address:      andrew.lozovsky@gmail.com
PostgreSQL version: 16.4
Operating system:   Ubuntu 20.04.5 LTS
Description:       

Standby cluster does not start after building from Primary cluster.

The copy2slave16.sh script was used for the building Standby cluster:
#!/bin/bash
. /home/postgres/pg.sh
LD_LIBRARY_PATH=$PGLIB $PGBIN/psql -Upgadmin -d postgres -f
/home/postgres/copy2slave16.sql > /home/postgres/copy.log 2>
/home/postgres/copy.err

copy2slave16.sql:
SELECT pg_backup_start(label => 'label', fast => true);
\! rsync -a /opt/pgdata-16/ <Standby_cluster_IP>:/opt/pgdata-16/ --exclude
postmaster.pid  --exclude pg_wal/  --exclude pg_replslot/ >
/home/postgres/rsync.log 2> /home/postgres/rsync.err
SELECT * FROM pg_backup_stop(wait_for_archive => true);

There was also an attempt to building the Standby cluster using
pg_basebackup.
Problem was a same.

The following messages are indicated in the log file:

2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  starting PostgreSQL
16.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1)
9.4.0, 64-bit
2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  listening on IPv4
address "0.0.0.0", port 5432
2024-10-21 10:38:32.672 EEST [2816535] [] [] [] []LOG:  listening on IPv6
address "::", port 5432
2024-10-21 10:38:32.677 EEST [2816535] [] [] [] []LOG:  listening on Unix
socket "/tmp/.s.PGSQL.5432"
2024-10-21 10:38:32.685 EEST [2816540] [] [] [] []LOG:  database system was
interrupted; last known up at 2024-10-21 06:58:37 EEST
cp: cannot stat '/opt/archivelog/00000002.history': No such file or
directory
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []LOG:  entering standby
mode
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []LOG:  invalid checkpoint
record
2024-10-21 10:38:32.725 EEST [2816540] [] [] [] []PANIC:  could not locate a
valid checkpoint record
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  startup process (PID
2816540) was terminated by signal 6: Aborted
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  aborting startup due
to startup process failure
2024-10-21 10:38:32.851 EEST [2816535] [] [] [] []LOG:  [pg_stat_monitor]
pgsm_shmem_shutdown: Shutdown initiated.
2024-10-21 10:38:32.872 EEST [2816535] [] [] [] []LOG:  database system is
shut down

The pg_hba.conf file contains all the settings for the primary and standby
clusters.
The standby.signal file is created in the data directory of the standby
cluster.

# CUSTOMIZED OPTIONS in postgresql.conf on Standby cluster
port = 5432
listen_addresses = '*'
superuser_reserved_connections = 3
log_timezone = 'Europe/Kiev'
timezone = 'Europe/Kiev'
logging_collector = on
log_directory = '/opt/pglogs'
#log_filename = 'p2p_unk-%a.log'
#log_filename = 'p2p_unp-%a.log'
log_filename = 'p2p_una-%a.log'
log_truncate_on_rotation = on
log_line_prefix = '%m [%p] [%a] [%u] [%d] [%h]'
log_statement = 'none'              # none, ddl, mod, all
log_connections = on
autovacuum = on
shared_preload_libraries = 'pg_stat_statements,pg_stat_monitor'
synchronous_commit = on
max_locks_per_transaction = 256
cluster_name = 'p2p_un'

max_connections = 200
shared_buffers = 16GB
effective_cache_size = 32GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1.2
effective_io_concurrency = 200
work_mem = 16MB
huge_pages = try
min_wal_size = 2GB
max_wal_size = 8GB
max_parallel_workers_per_gather = 4
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
log_min_duration_statement = 1s#min_time_query_in_log

archive_mode = on
wal_level = replica                     #logical
max_replication_slots = 14              #кількість підписок
max_wal_senders = 15                    #max_replication_slots+1
max_logical_replication_workers = 15    #max_replication_slots + резерв
max_worker_processes = 16               #max_logical_replication_workers +
1
wal_keep_size = 32

#archive_command = 'rsync -aq %p
postgres@<Standby_cluster_IP>:/opt/archivelog/%f'#AWS
#archive_command = 'rsync -aq %p
postgres@<Primary_cluster_IP>:/opt/archivelog/%f'#AWS
archive_command = 'cp %p /dev/null'

hot_standby = on

primary_conninfo = 'user=pgadmin host=<Primary_cluster_IP> port=5432
user=pgadmin sslmode=prefer sslcompression=1' # Amazon
#primary_conninfo = 'user=pgadmin host=<Standby_cluster_IP> port=5432
user=pgadmin sslmode=prefer sslcompression=1' # Amazon
recovery_target_timeline = 'latest'
restore_command = 'cp /opt/archivelog/%f %p'
archive_cleanup_command = '/usr/pgsql-16/bin/pg_archivecleanup
/opt/archivelog %r'
#primary_slot_name = 'standby_slot_unk'
#primary_slot_name = 'standby_slot_unp'
primary_slot_name = 'standby_slot_una'

log_lock_waits = 'on'
deadlock_timeout = '100ms'

# Parameters of the Server claster.
OS - Ubuntu 20.04.5 LTS (Focal Fossa)
CPU - 16
RAM - 64 Gb
DB Size - 6.7T

On Monday, October 21, 2024, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      18666
Logged by:          Андрій Лозовський
Email address:      andrew.lozovsky@gmail.com
PostgreSQL version: 16.4
Operating system:   Ubuntu 20.04.5 LTS
Description:       


There isn’t likely to be a bug here.
 

There was also an attempt to building the Standby cluster using
pg_basebackup.
Problem was a same.

I suggest to request help on using pg_basebackup on the -general mailing list.

David J.