psql: FATAL: the database system is starting up - Mailing list pgsql-general

From Tom K
Subject psql: FATAL: the database system is starting up
Date
Msg-id CAE3EmBDX+TRLot5Ar6fJwV_QazSrbQ8JmWb9ZhGtj9wgtu7c_g@mail.gmail.com
Whole thread Raw
Responses Re: psql: FATAL: the database system is starting up
List pgsql-general

Hey Guy's,


I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking out the cluster.  On recovery, PostgreSQL tends to come up while other databases just blow up.  That is until now.

After the most recent crash 2-3 weeks ago, the cluster is now running into this message but I'm not able to make heads or tails out of why it's throwing this:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf
2019-05-23 08:26:34.561 EDT [10101] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:26:34.565 EDT [10101] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:26:34.591 EDT [10101] LOG:  redirecting log output to logging collector process
2019-05-23 08:26:34.591 EDT [10101] HINT:  Future log output will appear in directory "log".
^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
2019-05-23 08:26:45.707 EDT [10105] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:26:45.715 EDT [10105] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:26:45.740 EDT [10105] LOG:  redirecting log output to logging collector process
2019-05-23 08:26:45.740 EDT [10105] HINT:  Future log output will appear in directory "log".
^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
2019-05-23 08:29:36.014 EDT [10188] LOG:  listening on IPv4 address "192.168.0.108", port 5432
2019-05-23 08:29:36.017 EDT [10188] LOG:  listening on Unix socket "./.s.PGSQL.5432"
2019-05-23 08:29:36.040 EDT [10188] LOG:  redirecting log output to logging collector process
2019-05-23 08:29:36.040 EDT [10188] HINT:  Future log output will appear in directory "log".

The database logs only print out this:

2019-05-23 08:25:44.476 EDT [10088] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:25:44.477 EDT [10088] LOG:  entering standby mode
2019-05-23 08:25:50.209 EDT [10086] LOG:  received fast shutdown request
2019-05-23 08:25:50.215 EDT [10089] LOG:  shutting down
2019-05-23 08:25:50.221 EDT [10086] LOG:  database system is shut down
2019-05-23 08:26:11.560 EDT [10094] LOG:  database system was shut down in recovery at 2019-05-23 08:25:50 EDT
2019-05-23 08:26:11.560 EDT [10094] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:11.560 EDT [10094] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:26:11.562 EDT [10094] LOG:  entering standby mode
2019-05-23 08:26:17.817 EDT [10092] LOG:  received fast shutdown request
2019-05-23 08:26:17.824 EDT [10095] LOG:  shutting down
2019-05-23 08:26:17.831 EDT [10092] LOG:  database system is shut down
2019-05-23 08:26:34.596 EDT [10103] LOG:  database system was shut down in recovery at 2019-05-23 08:26:17 EDT
2019-05-23 08:26:34.596 EDT [10103] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:34.596 EDT [10103] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.
2019-05-23 08:26:34.598 EDT [10103] LOG:  entering standby mode
2019-05-23 08:26:43.025 EDT [10101] LOG:  received fast shutdown request
2019-05-23 08:26:43.032 EDT [10104] LOG:  shutting down
2019-05-23 08:26:43.038 EDT [10101] LOG:  database system is shut down
2019-05-23 08:26:45.748 EDT [10107] LOG:  database system was shut down in recovery at 2019-05-23 08:26:43 EDT
2019-05-23 08:26:45.748 EDT [10107] WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
2019-05-23 08:26:45.748 EDT [10107] HINT:  The database server will regularly poll the pg_wal subdirectory to check for files placed there.

and when connecting, I get this cryptic message:

[root@psql01 log]# psql -U postgres -h 192.168.0.108
psql: FATAL:  the database system is starting up
[root@psql01 log]#

There's no space issue here.  The cluster stays in recovery. This state doesn't change and the DB is inaccessible.  

So I ran the above in debug mode ( -d 5 ) and got this:


DEBUG:  mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory

so I freed memory on the underlying physical this VM sits on but no luck. Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes) I get the message:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root      2217  2188  0 00:38 pts/1    00:00:00 tail -f postgresql-Thu.log
postgres  2512     1  4 00:42 ?        00:00:01 /usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres  2533     1  0 00:42 ?        00:00:00 /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=100
postgres  2535  2533  0 00:42 ?        00:00:00 postgres: postgres: logger process
postgres  2536  2533  0 00:42 ?        00:00:00 postgres: postgres: startup process   waiting for 000000010000000000000008
root      2664  2039  0 00:42 pts/0    00:00:00 grep --color=auto -Ei patroni|postgres
[root@psql01 ~]#

The system prints "postgres: postgres: startup process waiting for 000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace on the file reveals that it's looking for the following:

read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
read(5, 0x7fff9cb4eb87, 1)              = -1 EAGAIN (Resource temporarily unavailable)
open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No such file or directory)
epoll_create1(EPOLL_CLOEXEC)            = 3

But this file doesn't exist in the directory on this particular cluster node (psql01).  I copy the missing history files off another cluster node to this one and it does move forward but just get's stuck on another history file.  But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02).  But this yields this error:

PANIC:  replication checkpoint has wrong magic 0 instead of 307747550

and I'm not able to find very much to move me forward on this one either.  

So again I give up after a few hours of trying to recover psql02 and move on to the third node.  On the third cluster node (pgsql03) I get this error which I'm also not able to find much info on:

FATAL:  syntax error in history file: f2W

Hoping I'm able to recover the data from at least one of the nodes, make it primary, and get it to replicate to the others.

Thx, TK

pgsql-general by date:

Previous
From: Julie Nishimura
Date:
Subject: Re: POSTGRES_FSM_RELATIONS CRITICAL: DB control fsm relations used:79569 of 80000 (99%)
Next
From: Adrian Klaver
Date:
Subject: Re: psql: FATAL: the database system is starting up