Streaming replication: replicant server not starting (9.4.4, Win2008) - Mailing list pgsql-general
From | Tim Bowden |
---|---|
Subject | Streaming replication: replicant server not starting (9.4.4, Win2008) |
Date | |
Msg-id | 1517336187.18851.57.camel@mapforge.com.au Whole thread Raw |
Responses |
Re: Streaming replication: replicant server not starting (9.4.4, Win 2008)
|
List | pgsql-general |
I've inherited a PG 9.4.4 install on Win 2008 that I'm wanting to stream from (abt 80Gb on disk). Everything seems to be working from the "master" side. I have an AMI of the master host (a vmware instance in our soon to close data centre) that has been spun up in AWS (replicant). In it's "default" configuration (ie, exactly the same as master) it worked fine. I ran pg_basebackup on the master (from memory- details on work pc): pg_basebackup -D "e:\\mybasebackup\\" -F t -R -X f -z -c fast \ -h <ip_addr> -U <rep_user> Base backup is created fine. I've copied it up to AWS and extracted it into the correct location (after ensuring it is empty) on the replicant windows host (tar -xzvf base.tgz from a linux box with the pg data drive mounted). File metadata is changed during the extraction (ie, file timestamps- could this possibly be an issue?) but permissions on the win2008 replicant host have been set & double checked. When I try to start the pg server on replicant, the log streaming from master works (and keeps working till the server is rebooted), but the server can't start for some reason. Copy of pg log file on replicant: 2018-01-30 17:18:02 AWST LOG: database system was shut down in recovery at 2018-01-30 17:16:59 AWST 2018-01-30 17:18:02 AWST LOG: entering standby mode 2018-01-30 17:18:02 AWST LOG: redo starts at C26/84018A30 2018-01-30 17:18:02 AWST LOG: consistent recovery state reached at C26/8401C6D0 2018-01-30 17:18:02 AWST LOG: invalid record length at C26/8401C6D0 2018-01-30 17:18:03 AWST LOG: started streaming WAL from primary at C26/84000000 on timeline 1 2018-01-30 17:18:03 AWST FATAL: the database system is starting up 2018-01-30 17:18:04 AWST FATAL: the database system is starting up 2018-01-30 17:18:05 AWST FATAL: the database system is starting up 2018-01-30 17:18:06 AWST FATAL: the database system is starting up 2018-01-30 17:18:07 AWST FATAL: the database system is starting up 2018-01-30 17:18:08 AWST FATAL: the database system is starting up The last line is repeated for about a minute till it gives up. It leaves behind a bunch of processes that keep streaming log files as master creates new wal records. This keeps going till the replicant box is rebooted (easiest way to clean up the processes and free up port 5432). I can keep rebooting the replicant host and on retrying to start the pg server, it will apply any new wal files that have been streamed, then fail to start the server but keep streaming again. The Windows service start command for PG: "C:\Program Files (x86)\PostgreSQL\9.4.4\bin\pg_ctl.exe" runservice -N "postgresql-9.4" -D "E:\PostgreSQL\9.4.4\data" -w Config files: Master postgresql.conf: dynamic_shared_memory_type = windows wal_level = hot_standby # Was "archive", but we'd like to do hot... archive_mode = on archive_command = 'copy "%p" "e:\\PostgreSQL\\9.4.4\\wal_archive\\%f"' # Windows max_wal_senders = 8 wal_keep_segments = 900 # Sick of losing old segments while I fix this max_replication_slots = 8 log_line_prefix = '%t ' log_timezone = 'Australia/Perth' datestyle = 'iso, dmy' timezone = 'Australia/Perth' lc_messages = 'English_Australia.1252' lc_monetary = 'English_Australia.1252' lc_numeric = 'English_Australia.1252' lc_time = 'English_Australia.1252' default_text_search_config = 'pg_catalog.english' listen_addresses = '*' port = 5432 max_connections = 300 shared_buffers = 500MB work_mem = 32MB maintenance_work_mem = 128MB checkpoint_segments = 100 random_page_cost = 2.0 effective_cache_size = 1500MB log_destination = 'stderr' logging_collector = on log_min_duration_statement = 500 log_line_prefix = '%t ' Replicant postgresql.conf (essentially a copy from master with minimal changes): dynamic_shared_memory_type = windows wal_level = archive max_standby_streaming_delay = 30s wal_receiver_status_interval = 30s log_line_prefix = '%t ' log_timezone = 'Australia/Perth' datestyle = 'iso, dmy' timezone = 'Australia/Perth' lc_messages = 'English_Australia.1252' lc_monetary = 'English_Australia.1252' lc_numeric = 'English_Australia.1252' lc_time = 'English_Australia.1252' default_text_search_config = 'pg_catalog.english' listen_addresses = '*' port = 5432 max_connections = 300 shared_buffers = 500MB work_mem = 32MB maintenance_work_mem = 128MB checkpoint_segments = 100 random_page_cost = 2.0 effective_cache_size = 1500MB log_destination = 'stderr' logging_collector = on log_min_duration_statement = 500 log_line_prefix = '%t ' replicant recovery.conf: standby_mode = 'on' primary_conninfo = 'host=<master_ip> user=<rep_user> password=<don't_tell> connect_timeout=10 application_name=replicant_3' primary_slot_name = 'replicant_3' recovery_min_apply_delay = 10 #restore_command = 'copy "e:\\PostgreSQL\\9.4.4\\basebackup\\wal_archive\\%f" "%p" ' trigger_file = 'recovery_done.txt' The restore_command was uncommented at first and worked fine. All new updates now are provided by streaming. Any ideas? I'm tearing my hair out with this. The boxes are about as identical as you can get, apart from the fact that one is running on VMware and the other in AWS. Thanks, Tim Bowden
pgsql-general by date: