Re: psql: FATAL: the database system is starting up - Mailing list pgsql-general

From Adrian Klaver
Subject Re: psql: FATAL: the database system is starting up
Date
Msg-id 5b7e0d74-8d6e-c3b8-fb62-3aab1fe0145d@aklaver.com
Whole thread Raw
In response to psql: FATAL: the database system is starting up  (Tom K <tomkcpr@gmail.com>)
Responses Re: psql: FATAL: the database system is starting up
List pgsql-general
On 5/27/19 9:59 PM, Tom K wrote:
> Hey Guy's,
> 
> 
> I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 
> RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking 
> out the cluster.  On recovery, PostgreSQL tends to come up while other 
> databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming 
replication, correct?

Is there WAL archiving set up?


> 
> After the most recent crash 2-3 weeks ago, the cluster is now running 
> into this message but I'm not able to make heads or tails out of why 
> it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is 
that  more recent?

When you refer to history files below are you talking about WAL files or 
something else?

Is this:

"recovery command file "recovery.conf" specified neither 
primary_conninfo nor restore_command"

true?



> 
> |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT 
> [10101]LOG:listening onIPv4 address "192.168.0.108",port 
> 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log 
> output tologging collector process 2019-05-2308:26:34.591EDT 
> [10101]HINT:Future log output will appear indirectory 
> "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.108 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 
> --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening 
> onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT 
> [10105]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log 
> output tologging collector process 2019-05-2308:26:45.740EDT 
> [10105]HINT:Future log output will appear indirectory 
> "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.1 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 
> --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening 
> onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT 
> [10188]LOG:listening onUnix socket 
> "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log 
> output tologging collector process 2019-05-2308:29:36.040EDT 
> [10188]HINT:Future log output will appear indirectory "log".|
> 
> The database logs only print out this:
> 
> |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly 
> poll the pg_wal subdirectory tocheckforfiles placed 
> there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode 
> 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest 
> 2019-05-2308:25:50.215EDT [10089]LOG:shutting down 
> 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down 
> 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down 
> inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT 
> [10094]WARNING:recovery command file"recovery.conf"specified neither 
> primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT 
> [10094]HINT:The databaseserver will regularly poll the pg_wal 
> subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT 
> [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT 
> [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT 
> [10095]LOG:shutting down 2019-05-2308:26:17.831EDT 
> [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT 
> [10103]LOG:databasesystem was shut down inrecovery at 
> 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery 
> command file"recovery.conf"specified neither primary_conninfo nor 
> restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver 
> will regularly poll the pg_wal subdirectory tocheckforfiles placed 
> there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode 
> 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest 
> 2019-05-2308:26:43.032EDT [10104]LOG:shutting down 
> 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down 
> 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down 
> inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT 
> [10107]WARNING:recovery command file"recovery.conf"specified neither 
> primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT 
> [10107]HINT:The databaseserver will regularly poll the pg_wal 
> subdirectory tocheckforfiles placed there.|
> 
> and when connecting, I get this cryptic message:
> 
> |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the 
> databasesystem isstarting up [root@psql01 log]#|
> 
> There's no space issue here.  The cluster stays in recovery. This state 
> doesn't change and the DB is inaccessible.
> 
> So I ran the above in debug mode ( -d 5 ) and got this:
> 
> 
> |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot 
> allocate memory|
> 
> so I freed memory on the underlying physical this VM sits on but no 
> luck. Same memory error and same result as above.
> 
> When patroni is started on just this one node (troubleshooting purposes) 
> I get the message:
> 
> |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root 
> 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres 
> 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml 
> postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni 
> --config-file=/data/patroni/postgresql.conf --hot_standby=on 
> --listen_addresses=192.168.0.108 --max_worker_processes=8 
> --max_locks_per_transaction=64 --wal_level=replica 
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 
> --max_replication_slots=10 --max_connections=100postgres 
> 25352533000:42?00:00:00postgres:postgres:logger process postgres 
> 25362533000:42?00:00:00postgres:postgres:startup process waiting 
> for000000010000000000000008root 26642039000:42pts/000:00:00grep 
> --color=auto -Ei patroni|postgres[root@psql01 ~]#|
> 
> The system prints "postgres: postgres: startup process waiting for 
> 000000010000000000000008"
> 
> Seems it's trying to replay the history log but can't? Running an strace 
> on the file reveals that it's looking for the following:
> 
> |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
> unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily 
> unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such 
> fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
> 
> But this file doesn't exist in the directory on this particular cluster 
> node (psql01).  I copy the missing history files off another cluster 
> node to this one and it does move forward but just get's stuck on 
> another history file.  But same error message as above on psql01.
> 
> So I give up and try to recover the secondary node (psql02).  But this 
> yields this error:
> 
> |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
> 
> and I'm not able to find very much to move me forward on this one either.
> 
> So again I give up after a few hours of trying to recover psql02 and 
> move on to the third node.  On the third cluster node (pgsql03) I get 
> this error which I'm also not able to find much info on:
> 
> |FATAL:syntax error inhistory file:f2W|
> 
> Hoping I'm able to recover the data from at least one of the nodes, make 
> it primary, and get it to replicate to the others.
> 
> Thx, TK
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com



pgsql-general by date:

Previous
From: Tom K
Date:
Subject: psql: FATAL: the database system is starting up
Next
From: Adrian Klaver
Date:
Subject: Re: Alternate methods for multiple rows input/output to a function.