Re: psql: FATAL: the database system is starting up - Mailing list pgsql-general

From Tom K
Subject Re: psql: FATAL: the database system is starting up
Date
Msg-id CAE3EmBC5biD+xPEHfGWg0iuTojhWvJrREnSh-3AViQmOsOx9ZQ@mail.gmail.com
Whole thread Raw
In response to Re: psql: FATAL: the database system is starting up  (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses Re: psql: FATAL: the database system is starting up
List pgsql-general


On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 5/27/19 9:59 PM, Tom K wrote:
> Hey Guy's,
>
>
> I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
> RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking
> out the cluster.  On recovery, PostgreSQL tends to come up while other
> databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming
replication, correct?

Is there WAL archiving set up?

Correct.  Master election occurs through Patroni.  WAL level is set to:

wal_level = 'replica'

So no archiving.  
 


>
> After the most recent crash 2-3 weeks ago, the cluster is now running
> into this message but I'm not able to make heads or tails out of why
> it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that  more recent?

Haven't been able to bring this PostgresSQL cluster up ( run the cluster ) since 2-3 weeks ago.  Tried quite a few combinations of options to recover this.  No luck.  Had storage failures earlier, even with corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to come up each time without any recovery effort on my part.  
 

When you refer to history files below are you talking about WAL files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

True. recovery.conf is controlled by Patroni.  Contents of this file remained the same for all the cluster nodes with the exception of the primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location so when troubleshooting, I can always revert to the first state the cluster was in when it failed.  

Thx,
TK
 



>
> |-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
> [10101]LOG:listening onIPv4 address "192.168.0.108",port
> 54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:34.591EDT
> [10101]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
> [10105]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
> output tologging collector process 2019-05-2308:26:45.740EDT
> [10105]HINT:Future log output will appear indirectory
> "log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.1 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10
> --max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
> onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
> [10188]LOG:listening onUnix socket
> "./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
> output tologging collector process 2019-05-2308:29:36.040EDT
> [10188]HINT:Future log output will appear indirectory "log".|
>
> The database logs only print out this:
>
> |2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
> poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
> 2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
> 2019-05-2308:25:50.215EDT [10089]LOG:shutting down
> 2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
> 2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
> [10094]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
> [10094]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
> [10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
> [10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
> [10095]LOG:shutting down 2019-05-2308:26:17.831EDT
> [10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
> [10103]LOG:databasesystem was shut down inrecovery at
> 2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
> command file"recovery.conf"specified neither primary_conninfo nor
> restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
> will regularly poll the pg_wal subdirectory tocheckforfiles placed
> there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
> 2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
> 2019-05-2308:26:43.032EDT [10104]LOG:shutting down
> 2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
> 2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
> inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
> [10107]WARNING:recovery command file"recovery.conf"specified neither
> primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
> [10107]HINT:The databaseserver will regularly poll the pg_wal
> subdirectory tocheckforfiles placed there.|
>
> and when connecting, I get this cryptic message:
>
> |[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
> databasesystem isstarting up [root@psql01 log]#|
>
> There's no space issue here.  The cluster stays in recovery. This state
> doesn't change and the DB is inaccessible.
>
> So I ran the above in debug mode ( -d 5 ) and got this:
>
>
> |DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
> allocate memory|
>
> so I freed memory on the underlying physical this VM sits on but no
> luck. Same memory error and same result as above.
>
> When patroni is started on just this one node (troubleshooting purposes)
> I get the message:
>
> |[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
> 22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
> 25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
> postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
> --config-file=/data/patroni/postgresql.conf --hot_standby=on
> --listen_addresses=192.168.0.108 --max_worker_processes=8
> --max_locks_per_transaction=64 --wal_level=replica
> --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
> --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
> --max_replication_slots=10 --max_connections=100postgres
> 25352533000:42?00:00:00postgres:postgres:logger process postgres
> 25362533000:42?00:00:00postgres:postgres:startup process waiting
> for000000010000000000000008root 26642039000:42pts/000:00:00grep
> --color=auto -Ei patroni|postgres[root@psql01 ~]#|
>
> The system prints "postgres: postgres: startup process waiting for
> 000000010000000000000008"
>
> Seems it's trying to replay the history log but can't? Running an strace
> on the file reveals that it's looking for the following:
>
> |read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
> unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
> fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|
>
> But this file doesn't exist in the directory on this particular cluster
> node (psql01).  I copy the missing history files off another cluster
> node to this one and it does move forward but just get's stuck on
> another history file.  But same error message as above on psql01.
>
> So I give up and try to recover the secondary node (psql02).  But this
> yields this error:
>
> |PANIC:replicationcheckpointhas wrong magic 0instead of307747550|
>
> and I'm not able to find very much to move me forward on this one either.
>
> So again I give up after a few hours of trying to recover psql02 and
> move on to the third node.  On the third cluster node (pgsql03) I get
> this error which I'm also not able to find much info on:
>
> |FATAL:syntax error inhistory file:f2W|
>
> Hoping I'm able to recover the data from at least one of the nodes, make
> it primary, and get it to replicate to the others.
>
> Thx, TK
>


--
Adrian Klaver
adrian.klaver@aklaver.com

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: POSTGRES_FSM_RELATIONS CRITICAL: DB control fsm relations used:79569 of 80000 (99%)
Next
From: Pawan Sharma
Date:
Subject: