Re: psql: FATAL: the database system is starting up - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: psql: FATAL: the database system is starting up |
Date | |
Msg-id | 5b7e0d74-8d6e-c3b8-fb62-3aab1fe0145d@aklaver.com Whole thread Raw |
In response to | psql: FATAL: the database system is starting up (Tom K <tomkcpr@gmail.com>) |
Responses |
Re: psql: FATAL: the database system is starting up
|
List | pgsql-general |
On 5/27/19 9:59 PM, Tom K wrote: > Hey Guy's, > > > I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3 > RHEL 7.6 VM's. Every now and then the underlying storage crashes taking > out the cluster. On recovery, PostgreSQL tends to come up while other > databases just blow up. That is until now. The above is one master and two standby servers connected via streaming replication, correct? Is there WAL archiving set up? > > After the most recent crash 2-3 weeks ago, the cluster is now running > into this message but I'm not able to make heads or tails out of why > it's throwing this: So you have not been able to run the cluster the past 2-3 weeks or is that more recent? When you refer to history files below are you talking about WAL files or something else? Is this: "recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command" true? > > |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT > [10101]LOG:listening onIPv4 address "192.168.0.108",port > 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket > "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log > output tologging collector process 2019-05-2308:26:34.591EDT > [10101]HINT:Future log output will appear indirectory > "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni > --config-file=/data/patroni/postgresql.conf --hot_standby=on > --listen_addresses=192.168.0.108 --max_worker_processes=8 > --max_locks_per_transaction=64 --wal_level=replica > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > --max_replication_slots=10 > --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening > onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT > [10105]LOG:listening onUnix socket > "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log > output tologging collector process 2019-05-2308:26:45.740EDT > [10105]HINT:Future log output will appear indirectory > "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni > --config-file=/data/patroni/postgresql.conf --hot_standby=on > --listen_addresses=192.168.0.1 --max_worker_processes=8 > --max_locks_per_transaction=64 --wal_level=replica > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > --max_replication_slots=10 > --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening > onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT > [10188]LOG:listening onUnix socket > "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log > output tologging collector process 2019-05-2308:29:36.040EDT > [10188]HINT:Future log output will appear indirectory "log".| > > The database logs only print out this: > > |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly > poll the pg_wal subdirectory tocheckforfiles placed > there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode > 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest > 2019-05-2308:25:50.215EDT [10089]LOG:shutting down > 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down > 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down > inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT > [10094]WARNING:recovery command file"recovery.conf"specified neither > primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT > [10094]HINT:The databaseserver will regularly poll the pg_wal > subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT > [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT > [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT > [10095]LOG:shutting down 2019-05-2308:26:17.831EDT > [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT > [10103]LOG:databasesystem was shut down inrecovery at > 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery > command file"recovery.conf"specified neither primary_conninfo nor > restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver > will regularly poll the pg_wal subdirectory tocheckforfiles placed > there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode > 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest > 2019-05-2308:26:43.032EDT [10104]LOG:shutting down > 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down > 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down > inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT > [10107]WARNING:recovery command file"recovery.conf"specified neither > primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT > [10107]HINT:The databaseserver will regularly poll the pg_wal > subdirectory tocheckforfiles placed there.| > > and when connecting, I get this cryptic message: > > |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the > databasesystem isstarting up [root@psql01 log]#| > > There's no space issue here. The cluster stays in recovery. This state > doesn't change and the DB is inaccessible. > > So I ran the above in debug mode ( -d 5 ) and got this: > > > |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot > allocate memory| > > so I freed memory on the underlying physical this VM sits on but no > luck. Same memory error and same result as above. > > When patroni is started on just this one node (troubleshooting purposes) > I get the message: > > |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root > 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres > 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml > postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni > --config-file=/data/patroni/postgresql.conf --hot_standby=on > --listen_addresses=192.168.0.108 --max_worker_processes=8 > --max_locks_per_transaction=64 --wal_level=replica > --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 > --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 > --max_replication_slots=10 --max_connections=100postgres > 25352533000:42?00:00:00postgres:postgres:logger process postgres > 25362533000:42?00:00:00postgres:postgres:startup process waiting > for000000010000000000000008root 26642039000:42pts/000:00:00grep > --color=auto -Ei patroni|postgres[root@psql01 ~]#| > > The system prints "postgres: postgres: startup process waiting for > 000000010000000000000008" > > Seems it's trying to replay the history log but can't? Running an strace > on the file reveals that it's looking for the following: > > |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily > unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily > unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such > fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3| > > But this file doesn't exist in the directory on this particular cluster > node (psql01). I copy the missing history files off another cluster > node to this one and it does move forward but just get's stuck on > another history file. But same error message as above on psql01. > > So I give up and try to recover the secondary node (psql02). But this > yields this error: > > |PANIC:replicationcheckpointhas wrong magic 0instead of307747550| > > and I'm not able to find very much to move me forward on this one either. > > So again I give up after a few hours of trying to recover psql02 and > move on to the third node. On the third cluster node (pgsql03) I get > this error which I'm also not able to find much info on: > > |FATAL:syntax error inhistory file:f2W| > > Hoping I'm able to recover the data from at least one of the nodes, make > it primary, and get it to replicate to the others. > > Thx, TK > -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: