Thread: Re: [BUGS] Crash observed during the start of the Postgres process

Re: [BUGS] Crash observed during the start of the Postgres process

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:
Hello,
 
Did you get a chance to take a look into the issue?
 
Please consider it with high priority. We will be awaiting your inputs.
 
Regards,
Sandhya
 
_____________________________________________
From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Thursday, April 20, 2017 1:36 PM
To: pgsql-hackers@postgresql.org; 'tgl@sss.pgh.pa.us' <tgl@sss.pgh.pa.us>; 'pgsql-bugs@postgresql.org' <pgsql-bugs@postgresql.org>
Cc: Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>
Subject: Crash observed during the start of the Postgres process
 
 
Hi ,
 
In Postgres 9.3.14, we are encountering crash during the start up of Postgres process. Logs state that it is trying to reap a dead process.
Attached the postgres debug logs and bt of the crash.
 
 
<< File: postgresql_crash_debug.log >>  << File: bt of Postgres crash.txt >>
 
Please help in analysing and debugging the issue.
 
Regards,
Sandhya
 
 
 

Re: [BUGS] Crash observed during the start of the Postgres process

From
Merlin Moncure
Date:
On Tue, Apr 25, 2017 at 8:44 AM, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya.k_s@nokia.com> wrote:
> Hello,
>
> Did you get a chance to take a look into the issue?
>
> Please consider it with high priority. We will be awaiting your inputs.

This email is heavily cross posted, which is obnoxious.  Can you paste
the relevant log snippet?

merlin



Re: [BUGS] Crash observed during the start of the Postgres process

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:
Hi Merlin,

Below is the log captured when the crash was encountered. 

STATEMENT:  select count(1) from pg_ls_dir(current_setting('data_directory')) where pg_ls_dir = 'backup_label'
LOG:  00000: duration: 4.083 ms
LOCATION:  exec_simple_query, postgres.c:1145
DEBUG:  00000: shmem_exit(0): 7 callbacks to make
LOCATION:  shmem_exit, ipc.c:212
DEBUG:  00000: proc_exit(0): 3 callbacks to make
LOCATION:  proc_exit_prepare, ipc.c:184
DEBUG:  00000: exit(0)
LOCATION:  proc_exit, ipc.c:135
DEBUG:  00000: shmem_exit(-1): 0 callbacks to make
LOCATION:  shmem_exit, ipc.c:212
DEBUG:  00000: proc_exit(-1): 0 callbacks to make
LOCATION:  proc_exit_prepare, ipc.c:184
DEBUG:  00000: reaping dead processes
LOCATION:  reaper, postmaster.c:2669
DEBUG:  00000: server process (PID 11104) exited with exit code 0
LOCATION:  LogChildExit, postmaster.c:3385
DEBUG:  00000: reaping dead processes
LOCATION:  reaper, postmaster.c:2669
LOG:  00000: startup process (PID 10265) was terminated by signal 6: Aborted
LOCATION:  LogChildExit, postmaster.c:3407
LOG:  00000: terminating any other active server processes
LOCATION:  HandleChildCrash, postmaster.c:3134
DEBUG:  00000: sending SIGQUIT to process 10994
LOCATION:  HandleChildCrash, postmaster.c:3233
DEBUG:  00000: sending SIGQUIT to process 10269
LOCATION:  HandleChildCrash, postmaster.c:3263
DEBUG:  00000: sending SIGQUIT to process 10268
LOCATION:  HandleChildCrash, postmaster.c:3275
WARNING:  57P02: terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
HINT:  In a moment you should be able to reconnect to the database and repeat your command.


Backtrace of the core generated:
(gdb) bt
#0  0x0000005563bcf9c0 in raise () from /lib64/libc.so.6
#1  0x0000005563bd42bc in abort () from /lib64/libc.so.6
#2  0x000000012039e228 in errfinish ()
#3  0x000000012039ef08 in elog_finish ()
#4  0x000000012009eb08 in btree_redo ()
#5  0x00000001200caff8 in StartupXLOG ()
#6  0x0000000120259958 in StartupProcessMain ()
#7  0x00000001200d590c in AuxiliaryProcessMain ()
#8  0x0000000120253434 in ?? ()

Let me know if any further clarification/information is needed.

Regards,
Sandhya

-----Original Message-----
From: Merlin Moncure [mailto:mmoncure@gmail.com] 
Sent: Tuesday, April 25, 2017 8:01 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-hackers@postgresql.org; tgl@sss.pgh.pa.us; pgsql-bugs@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore)
<prakash.itnal@nokia.com>;T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>
 
Subject: Re: [BUGS] Crash observed during the start of the Postgres process

On Tue, Apr 25, 2017 at 8:44 AM, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya.k_s@nokia.com> wrote:
> Hello,
>
> Did you get a chance to take a look into the issue?
>
> Please consider it with high priority. We will be awaiting your inputs.

This email is heavily cross posted, which is obnoxious.  Can you paste
the relevant log snippet?

merlin

Re: [BUGS] Crash observed during the start of the Postgres process

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:

Hi Merlin,

 

I have filtered the logs based on PID (19825) to see if this helps to debug the issue further. Here are some observations:

 

  1. Though the restoration is complete (0000001C000000000000012F.done file created), we see some error logs related to restoration failure as below:

 

                1: [19825-58f473ff.4d71-1] 2017-04-17 07:51:27.919 GMT < > LOG:  00000: database system was interrupted while in recovery at log time 2017-04-16 20:09:38 GMT

                  2: [19825-58f473ff.4d71-2] 2017-04-17 07:51:27.919 GMT < > HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.

                  3: [19825-58f473ff.4d71-3] 2017-04-17 07:51:27.919 GMT < > LOCATION:  StartupXLOG, xlog.c:4988

                  4: [19825-58f473ff.4d71-4] 2017-04-17 07:51:28.111 GMT < > DEBUG:  00000: standby_mode = 'true'

                  5: [19825-58f473ff.4d71-5] 2017-04-17 07:51:28.111 GMT < > LOCATION:  readRecoveryCommandFile, xlog.c:4312

                  6: [19825-58f473ff.4d71-6] 2017-04-17 07:51:28.111 GMT < > DEBUG:  00000: primary_conninfo = 'host=CFPU-1 port=5433 user=_qnrnwdbman'

                  7: [19825-58f473ff.4d71-7] 2017-04-17 07:51:28.111 GMT < > LOCATION:  readRecoveryCommandFile, xlog.c:4319

                  8: [19825-58f473ff.4d71-8] 2017-04-17 07:51:28.112 GMT < > DEBUG:  00000: trigger_file = '/mnt/DB_name/db_data/trigger_file'

                  9: [19825-58f473ff.4d71-9] 2017-04-17 07:51:28.112 GMT < > LOCATION:  readRecoveryCommandFile, xlog.c:4326

                  10: [19825-58f473ff.4d71-10] 2017-04-17 07:51:28.112 GMT < > DEBUG:  00000: restore_command = 'exit 1'

                  11: [19825-58f473ff.4d71-11] 2017-04-17 07:51:28.112 GMT < > LOCATION:  readRecoveryCommandFile, xlog.c:4186

                  12: [19825-58f473ff.4d71-12] 2017-04-17 07:51:28.112 GMT < > LOG:  00000: entering standby mode

                  13: [19825-58f473ff.4d71-13] 2017-04-17 07:51:28.112 GMT < > LOCATION:  StartupXLOG, xlog.c:5057

                  14: [19825-58f473ff.4d71-14] 2017-04-17 07:51:28.112 GMT < > DEBUG:  00000: executing restore command "exit 1"

                  15: [19825-58f473ff.4d71-15] 2017-04-17 07:51:28.112 GMT < > LOCATION:  RestoreArchivedFile, xlogarchive.c:201

                  16: [19825-58f473ff.4d71-16] 2017-04-17 07:51:28.133 GMT < > DEBUG:  00000: could not restore file "0000001C.history" from archive: return code 256

                  17: [19825-58f473ff.4d71-17] 2017-04-17 07:51:28.133 GMT < > LOCATION:  RestoreArchivedFile, xlogarchive.c:304

                  18: [19825-58f473ff.4d71-18] 2017-04-17 07:51:28.134 GMT < > DEBUG:  00000: executing restore command "exit 1"

                  19: [19825-58f473ff.4d71-19] 2017-04-17 07:51:28.134 GMT < > LOCATION:  RestoreArchivedFile, xlogarchive.c:201

                  20: [19825-58f473ff.4d71-20] 2017-04-17 07:51:28.156 GMT < > DEBUG:  00000: could not restore file "0000001C000000000000012F" from archive: return code 256

                  21: [19825-58f473ff.4d71-21] 2017-04-17 07:51:28.156 GMT < > LOCATION:  RestoreArchivedFile, xlogarchive.c:304

                  22: [19825-58f473ff.4d71-22] 2017-04-17 07:51:28.156 GMT < > DEBUG:  00000: checkpoint record is at 0/4BDE6A88

                  23: [19825-58f473ff.4d71-23] 2017-04-17 07:51:28.156 GMT < > LOCATION:  StartupXLOG, xlog.c:5182

                  24: [19825-58f473ff.4d71-24] 2017-04-17 07:51:28.156 GMT < > DEBUG:  00000: redo record is at 0/4BC2A998; shutdown FALSE

                  25: [19825-58f473ff.4d71-25] 2017-04-17 07:51:28.156 GMT < > LOCATION:  StartupXLOG, xlog.c:5259

                  26: [19825-58f473ff.4d71-26] 2017-04-17 07:51:28.156 GMT < > DEBUG:  00000: next transaction ID: 0/6252; next OID: 71601

                  27: [19825-58f473ff.4d71-27] 2017-04-17 07:51:28.156 GMT < > LOCATION:  StartupXLOG, xlog.c:526

 

  1. Also, we see the below highlighted logs repeated multiple (count=7984) times during the DB operation. Can you please explain what these logs mean from Postgres perspective?

 

                  46: [19825-58f473ff.4d71-46] 2017-04-17 07:51:28.189 GMT < > DEBUG:  00000: record known xact 6252 latestObservedXid 6251

                  47: [19825-58f473ff.4d71-47] 2017-04-17 07:51:28.189 GMT < > CONTEXT:  xlog redo AccessExclusive locks: xid 6252 db 16386 rel 16899

                  48: [19825-58f473ff.4d71-48] 2017-04-17 07:51:28.189 GMT < > LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2673

                  49: [19825-58f473ff.4d71-49] 2017-04-17 07:51:28.190 GMT < > DEBUG:  00000: record known xact 6252 latestObservedXid 6252

                  50: [19825-58f473ff.4d71-50] 2017-04-17 07:51:28.190 GMT < > CONTEXT:  xlog redo AccessExclusive locks: xid 6252 db 16386 rel 16941

                  51: [19825-58f473ff.4d71-51] 2017-04-17 07:51:28.190 GMT < > LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2673

 

  1. After these repeated messages, crash is observed with the below log message:

 

                  24491: [19825-58f473ff.4d71-24036] 2017-04-17 07:51:30.196 GMT < > CONTEXT:  xlog redo insert: rel 1663/16386/11691; tid 13/266

                  24492: [19825-58f473ff.4d71-24037] 2017-04-17 07:51:30.196 GMT < > LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2673

                  24493: [19825-58f473ff.4d71-24038] 2017-04-17 07:51:30.196 GMT < > DEBUG:  00000: record known xact 6271 latestObservedXid 6271

                  24494: [19825-58f473ff.4d71-24039] 2017-04-17 07:51:30.196 GMT < > CONTEXT:  xlog redo delete: index 1663/16386/11692; iblk 9, heap 1663/16386/11689;

                  24495: [19825-58f473ff.4d71-24040] 2017-04-17 07:51:30.196 GMT < > LOCATION:  RecordKnownAssignedTransactionIds, procarray.c:2673

                  24496: [19825-58f473ff.4d71-24041] 2017-04-17 07:51:30.196 GMT < > PANIC:  XX000: btree_xlog_delete_get_latestRemovedXid: cannot operate with inconsistent data

                  24497: [19825-58f473ff.4d71-24042] 2017-04-17 07:51:30.196 GMT < > CONTEXT:  xlog redo delete: index 1663/16386/11692; iblk 9, heap 1663/16386/11689;

                  24498: [19825-58f473ff.4d71-24043] 2017-04-17 07:51:30.196 GMT < > LOCATION:  btree_xlog_delete_get_latestRemovedXid, nbtxlog.c:634

                  24765: [19810-58f473ff.4d62-191] 2017-04-17 07:51:33.348 GMT < > LOG:  00000: startup process (PID 19825) was terminated by signal 6: Aborted

 

I have attached the logfiles related to PID 19825, which can provide more insight on the observations explained as above.

 

 

Regards,

Sandhya

 

-----Original Message-----
From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Tuesday, April 25, 2017 8:24 PM
To: Merlin Moncure <mmoncure@gmail.com>
Cc: pgsql-hackers@postgresql.org; tgl@sss.pgh.pa.us; pgsql-bugs@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>
Subject: RE: [BUGS] Crash observed during the start of the Postgres process

 

Hi Merlin,

 

Below is the log captured when the crash was encountered.

 

STATEMENT:  select count(1) from pg_ls_dir(current_setting('data_directory')) where pg_ls_dir = 'backup_label'

LOG:  00000: duration: 4.083 ms

LOCATION:  exec_simple_query, postgres.c:1145

DEBUG:  00000: shmem_exit(0): 7 callbacks to make

LOCATION:  shmem_exit, ipc.c:212

DEBUG:  00000: proc_exit(0): 3 callbacks to make

LOCATION:  proc_exit_prepare, ipc.c:184

DEBUG:  00000: exit(0)

LOCATION:  proc_exit, ipc.c:135

DEBUG:  00000: shmem_exit(-1): 0 callbacks to make

LOCATION:  shmem_exit, ipc.c:212

DEBUG:  00000: proc_exit(-1): 0 callbacks to make

LOCATION:  proc_exit_prepare, ipc.c:184

DEBUG:  00000: reaping dead processes

LOCATION:  reaper, postmaster.c:2669

DEBUG:  00000: server process (PID 11104) exited with exit code 0

LOCATION:  LogChildExit, postmaster.c:3385

DEBUG:  00000: reaping dead processes

LOCATION:  reaper, postmaster.c:2669

LOG:  00000: startup process (PID 10265) was terminated by signal 6: Aborted

LOCATION:  LogChildExit, postmaster.c:3407

LOG:  00000: terminating any other active server processes

LOCATION:  HandleChildCrash, postmaster.c:3134

DEBUG:  00000: sending SIGQUIT to process 10994

LOCATION:  HandleChildCrash, postmaster.c:3233

DEBUG:  00000: sending SIGQUIT to process 10269

LOCATION:  HandleChildCrash, postmaster.c:3263

DEBUG:  00000: sending SIGQUIT to process 10268

LOCATION:  HandleChildCrash, postmaster.c:3275

WARNING:  57P02: terminating connection because of crash of another server process

DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

HINT:  In a moment you should be able to reconnect to the database and repeat your command.

 

 

Backtrace of the core generated:

(gdb) bt

#0  0x0000005563bcf9c0 in raise () from /lib64/libc.so.6

#1  0x0000005563bd42bc in abort () from /lib64/libc.so.6

#2  0x000000012039e228 in errfinish ()

#3  0x000000012039ef08 in elog_finish ()

#4  0x000000012009eb08 in btree_redo ()

#5  0x00000001200caff8 in StartupXLOG ()

#6  0x0000000120259958 in StartupProcessMain ()

#7  0x00000001200d590c in AuxiliaryProcessMain ()

#8  0x0000000120253434 in ?? ()

 

Let me know if any further clarification/information is needed.

 

Regards,

Sandhya

 

-----Original Message-----

From: Merlin Moncure [mailto:mmoncure@gmail.com]

Sent: Tuesday, April 25, 2017 8:01 PM

To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>

Cc: pgsql-hackers@postgresql.org; tgl@sss.pgh.pa.us; pgsql-bugs@postgresql.org; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>

Subject: Re: [BUGS] Crash observed during the start of the Postgres process

 

On Tue, Apr 25, 2017 at 8:44 AM, K S, Sandhya (Nokia - IN/Bangalore)

<sandhya.k_s@nokia.com> wrote:

> Hello,

> Did you get a chance to take a look into the issue?

> Please consider it with high priority. We will be awaiting your inputs.

 

This email is heavily cross posted, which is obnoxious.  Can you paste

the relevant log snippet?

 

merlin

Attachment

Re: [BUGS] Crash observed during the start of the Postgres process

From
Tom Lane
Date:
"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya.k_s@nokia.com> writes:
> I have filtered the logs based on PID (19825) to see if this helps to
> debug the issue further.

Is this really a stock Postgres build?

The proximate cause of the PANIC is that the startup process is seeing
other processes active even though it hasn't reachedConsistency.  This
is bad on any number of levels, quite aside from that particular PANIC,
because those other processes are presumably seeing non-consistent
database state.  Looking elsewhere in the log, we see that indeed there
seem to be several backend processes happily executing commands.
For instance, here's the trace of one of them starting up:

[19810-58f473ff.4d62-187] 2017-04-17 07:51:28.783 GMT < > DEBUG:  00000: forked new backend, pid=19850 socket=10
[19810-58f473ff.4d62-188] 2017-04-17 07:51:28.783 GMT < > LOCATION:  BackendStartup, postmaster.c:3884
[19850-58f47400.4d8a-1] 2017-04-17 07:51:28.783 GMT <  > LOG:  57P03: the database system is starting up
[19850-58f47400.4d8a-2] 2017-04-17 07:51:28.783 GMT <  > LOCATION:  ProcessStartupPacket, postmaster.c:2143
[19850-58f47400.4d8a-3] 2017-04-17 07:51:28.784 GMT <  authentication> DEBUG:  00000: postgres child[19850]: starting
with( 

Now, that LOG message proves that this backend has observed that the
database is not ready to allow connections.  So why did it only emit the
message as LOG and keep going?  The code for this in 9.3 looks like
/* * If we're going to reject the connection due to database state, say so * now instead of wasting cycles on an
authenticationexchange. (This also * allows a pg_ping utility to be written.) */switch (port->canAcceptConnections){
caseCAC_STARTUP:        ereport(FATAL,                (errcode(ERRCODE_CANNOT_CONNECT_NOW),                 errmsg("the
databasesystem is starting up")));        break;... 

I can't draw any other conclusion but that you've hacked something
to make FATAL act like LOG.  Which is a fatal mistake.  Errors that
are marked FATAL are generally ones where allowing the process to
keep going is not an acceptable outcome.
        regards, tom lane



Re: [BUGS] Crash observed during the start of the Postgres process

From
Robert Haas
Date:
On Fri, May 12, 2017 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I can't draw any other conclusion but that you've hacked something
> to make FATAL act like LOG.  Which is a fatal mistake.

I see what you did there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [BUGS] Crash observed during the start of the Postgres process

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:
Hi Tom Lane,

After removing our patch to change FATAL to LOG, we are not observing the crash now.

Thank you for your support. We were struck with this issue for a while.

Regards,
Sandhya

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, May 12, 2017 10:08 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: 'Merlin Moncure' <mmoncure@gmail.com>; 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>;
'pgsql-bugs@postgresql.org'<pgsql-bugs@postgresql.org>; Itnal, Prakash (Nokia - IN/Bangalore)
<prakash.itnal@nokia.com>;T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com> 
Subject: Re: [BUGS] Crash observed during the start of the Postgres process

"K S, Sandhya (Nokia - IN/Bangalore)" <sandhya.k_s@nokia.com> writes:
> I have filtered the logs based on PID (19825) to see if this helps to
> debug the issue further.

Is this really a stock Postgres build?

The proximate cause of the PANIC is that the startup process is seeing
other processes active even though it hasn't reachedConsistency.  This
is bad on any number of levels, quite aside from that particular PANIC,
because those other processes are presumably seeing non-consistent
database state.  Looking elsewhere in the log, we see that indeed there
seem to be several backend processes happily executing commands.
For instance, here's the trace of one of them starting up:

[19810-58f473ff.4d62-187] 2017-04-17 07:51:28.783 GMT < > DEBUG:  00000: forked new backend, pid=19850 socket=10
[19810-58f473ff.4d62-188] 2017-04-17 07:51:28.783 GMT < > LOCATION:  BackendStartup, postmaster.c:3884
[19850-58f47400.4d8a-1] 2017-04-17 07:51:28.783 GMT <  > LOG:  57P03: the database system is starting up
[19850-58f47400.4d8a-2] 2017-04-17 07:51:28.783 GMT <  > LOCATION:  ProcessStartupPacket, postmaster.c:2143
[19850-58f47400.4d8a-3] 2017-04-17 07:51:28.784 GMT <  authentication> DEBUG:  00000: postgres child[19850]: starting
with( 

Now, that LOG message proves that this backend has observed that the
database is not ready to allow connections.  So why did it only emit the
message as LOG and keep going?  The code for this in 9.3 looks like
/* * If we're going to reject the connection due to database state, say so * now instead of wasting cycles on an
authenticationexchange. (This also * allows a pg_ping utility to be written.) */switch (port->canAcceptConnections){
caseCAC_STARTUP:        ereport(FATAL,                (errcode(ERRCODE_CANNOT_CONNECT_NOW),                 errmsg("the
databasesystem is starting up")));        break;... 

I can't draw any other conclusion but that you've hacked something
to make FATAL act like LOG.  Which is a fatal mistake.  Errors that
are marked FATAL are generally ones where allowing the process to
keep going is not an acceptable outcome.
        regards, tom lane