Thread: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

[HACKERS] Postgres process invoking exit resulting in sh-QUIT core

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:
Hi,
 
We are using Postgres version 9.3.14 over linux based OS and we are observing sh-QUIT core files randomly when we are restarting the system(occurrence seen once in 30 times).
Backtrace is showing as below:
 
Loaded symbols for /lib64/ld.so.1
Core was generated by `sh -c exit 1'.
Program terminated with signal 3, Quit.
#0  0x0000005559ed78f0 in do_lookup_x () from /lib64/ld.so.1
(gdb) bt
#0  0x0000005559ed78f0 in do_lookup_x () from /lib64/ld.so.1
#1  0x0000005559ed7b88 in _dl_lookup_symbol_x () from /lib64/ld.so.1
#2  0x0000005559ed916c in _dl_relocate_object () from /lib64/ld.so.1
#3  0x0000005559ed0b6c in dl_main () from /lib64/ld.so.1
#4  0x0000005559ee5214 in _dl_sysdep_start () from /lib64/ld.so.1
#5  0x0000005559ece1b0 in _dl_start_final () from /lib64/ld.so.1
#6  0x0000005559ece3f0 in _dl_start () from /lib64/ld.so.1
#7  0x0000005559ecdc10 in __start () from /lib64/ld.so.1
Backtrace stopped: frame did not save the PC
 
When we checked the process listing during the time of core generation, we found Postgres startup process is invoking “sh -c exit 1”:
4518      9249  0.1  0.0 155964  2036 ?        Ss   15:05   0:00 postgres: startup process   waiting for 0000000100000000000002EB
4518     10288  0.0  0.0   3600   508 ?        S    15:11   0:00  \_ sh -c exit 1
 
We tried disabling DB and running the same testcase which didn’t result in core being generated.
Also we are using immediate shutdown mode which uses SIGQUIT.
 
Can you please help us in debugging the issue ? 
 
Regards,
Sandhya
 
 
 

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

From
Craig Ringer
Date:
On 30 June 2017 at 17:41, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya.k_s@nokia.com> wrote:

> When we checked the process listing during the time of core generation, we
> found Postgres startup process is invoking “sh -c exit 1”:
> 4518      9249  0.1  0.0 155964  2036 ?        Ss   15:05   0:00 postgres:
> startup process   waiting for 0000000100000000000002EB

Looks like an archive_command or restore_command .

If 'sh' is dumping core, you probably have issues at a low level in
the kernel, file system, etc. Check dmesg.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Postgres process invoking exit resulting in sh-QUITcore

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:
Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core is generated when Postgres is invoking the
shellto exit and may not be due to kernel or file system issues. I will try to reproduce the issue with dmesg output
beingprinted.
 

However, is there any instance in Postgres where 'sh -c exit 1' will be invoked?

Regards,
Sandhya

-----Original Message-----
From: Craig Ringer [mailto:craig@2ndquadrant.com] 
Sent: Friday, June 30, 2017 5:40 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-hackers@postgresql.org; pgsql-bugs@postgresql.org; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>;
Itnal,Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
 
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

On 30 June 2017 at 17:41, K S, Sandhya (Nokia - IN/Bangalore)
<sandhya.k_s@nokia.com> wrote:

> When we checked the process listing during the time of core generation, we
> found Postgres startup process is invoking “sh -c exit 1”:
> 4518      9249  0.1  0.0 155964  2036 ?        Ss   15:05   0:00 postgres:
> startup process   waiting for 0000000100000000000002EB

Looks like an archive_command or restore_command .

If 'sh' is dumping core, you probably have issues at a low level in
the kernel, file system, etc. Check dmesg.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

From
Craig Ringer
Date:


On 3 Jul. 2017 23:01, "K S, Sandhya (Nokia - IN/Bangalore)" <sandhya.k_s@nokia.com> wrote:
Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core is generated when Postgres is invoking the shell to exit and may not be due to kernel or file system issues. I will try to reproduce the issue with dmesg output being printed.

However, is there any instance in Postgres where 'sh -c exit 1' will be invoked?

Most likely it's used directly or indirectly by an archive_commsnd or restore_comand you have configured.

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUITcore

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:

Hi Craig,

 

You were right about the restore_command.

 

We found the below logs:

 

Jul  6 18:39:21.527584 notice CFPU-1 postgres[5154]: [6-1] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly

Jul  6 18:39:21.527584 notice CFPU-1 postgres[5154]: [6-2] This probably means the server terminated abnormally

Jul  6 18:39:21.528072 notice CFPU-1 postgres[5154]: [6-3] before or while processing the request.

Jul  6 18:39:21.528072 notice CFPU-1 postgres[5154]: [6-4]

Jul  6 18:39:21.528072 err CFPU-1 postgres[5154]: [7-1] FATAL:  socket not open

Jul  6 18:39:21.536842 info CFPU-1 postgres[5097]: [5-1] LOG:  received immediate shutdown request

Jul  6 18:39:21.537460 notice CFPU-1 postgres[5198]: [5-1] WARNING:  terminating connection because of crash of another server process

Jul  6 18:39:21.537928 notice CFPU-1 postgres[5198]: [5-2] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

Jul  6 18:39:21.538410 notice CFPU-1 postgres[5198]: [5-3] HINT:  In a moment you should be able to reconnect to the database and repeat your command.

Jul  6 18:39:21.539684 notice CFPU-1 postgres[5157]: [5-1] WARNING:  terminating connection because of crash of another server process

Jul  6 18:39:21.539684 notice CFPU-1 postgres[5157]: [5-2] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

Jul  6 18:39:21.540016 notice CFPU-1 postgres[5157]: [5-3] HINT:  In a moment you should be able to reconnect to the database and repeat your command.

Jul  6 18:39:21.544922 info CFPU-1 FSPostgresWD: postgres db server exited with error 0

Jul  6 18:39:22.178637 err CFPU-1 postgres[5101]: [20-1] FATAL:  could not restore file "0000000300000002000001CB" from archive: return code 131

 

Immediately after this sh-QUIT core is generated. Searched for return code 131 for restore but didn’t find any helpful information.

Also if we use fast shutdown method issue is not observed but occurring in immediate shutdown method.

 

Collected the output of dmesg but it did not help much. PFA dmesg output when the crash occurred.

Process invoking is happening as below:

4518      5101  0.1  0.0 155968  2056 ?        Ss   18:32   0:00 postgres: startup process   waiting for 0000000300000002000001CB

4518      7919  0.0  0.0   3600   680 ?        S    18:39   0:00  \_ sh -c exit 1

 

Can you help us in debugging the issue further?

 

Regards,
Sandhya

 

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Wednesday, July 05, 2017 6:20 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-bugs <pgsql-bugs@postgresql.org>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
Subject: RE: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

 

 

 

On 3 Jul. 2017 23:01, "K S, Sandhya (Nokia - IN/Bangalore)" <sandhya.k_s@nokia.com> wrote:

Hi Craig,

Thanks for the response.

Scenario tried here is restart of the system multiple times. sh-QUIT core is generated when Postgres is invoking the shell to exit and may not be due to kernel or file system issues. I will try to reproduce the issue with dmesg output being printed.

However, is there any instance in Postgres where 'sh -c exit 1' will be invoked?

 

Most likely it's used directly or indirectly by an archive_commsnd or restore_comand you have configured.

Attachment

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

From
Craig Ringer
Date:
On 7 July 2017 at 15:10, K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com> wrote:

Hi Craig,

 

You were right about the restore_command.


This all makes sense then.

PostgreSQL sends SIGQUIT for immediate shutdown to its children. So the restore_command would get signalled too.

Can't immediately explain the exit code, and SIGQUIT should _not_ generate a core file. Can you show the result of attaching 'gdb' to the core file and running 'bt full' ?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUITcore

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:

Hi Craig,

 

The scenario is lock and unlock of the system for 30 times. During this scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different locations.

I have attached output for 2 such instance.

 

Regards,

Sandhya

 

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Friday, July 07, 2017 12:55 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-bugs <pgsql-bugs@postgresql.org>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

 

On 7 July 2017 at 15:10, K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com> wrote:

Hi Craig,

 

You were right about the restore_command.

 

This all makes sense then.

 

PostgreSQL sends SIGQUIT for immediate shutdown to its children. So the restore_command would get signalled too.

 

Can't immediately explain the exit code, and SIGQUIT should _not_ generate a core file. Can you show the result of attaching 'gdb' to the core file and running 'bt full' ?

 

--

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

From
Craig Ringer
Date:
On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com> wrote:

Hi Craig,

 

The scenario is lock and unlock of the system for 30 times. During this scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different locations.

I have attached output for 2 such instance.



You seem to be missing debug symbols. Install appropriate debuginfo packages.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUITcore

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:

Hi Craig,

 

Here is bt after installing all the missing debuginfo packages.

 

(gdb) bt

#0  0x000000fff7682f18 in do_lookup_x (undef_name=undef_name@entry=0xfff75cece5 "_Jv_RegisterClasses", new_hash=new_hash@entry=2681263574,

    old_hash=old_hash@entry=0xffffa159b8, ref=0xfff75ceac8, result=result@entry=0xffffa159a0, scope=<optimized out>, i=1, version=version@entry=0x0,

    flags=flags@entry=1, skip=skip@entry=0x0, type_class=type_class@entry=0, undef_map=undef_map@entry=0xfff76a9478) at dl-lookup.c:444

#1  0x000000fff76839a0 in _dl_lookup_symbol_x (undef_name=0xfff75cece5 "_Jv_RegisterClasses", undef_map=0xfff76a9478, ref=0xffffa15a90,

    symbol_scope=0xfff76a9980, version=0x0, type_class=<optimized out>, flags=<optimized out>, skip_map=0x0) at dl-lookup.c:833

#2  0x000000fff7685730 in elf_machine_got_rel (lazy=1, map=0xfff76a9478) at ../sysdeps/mips/dl-machine.h:870

#3  elf_machine_runtime_setup (profile=<optimized out>, lazy=1, l=0xfff76a9478) at ../sysdeps/mips/dl-machine.h:916

#4  _dl_relocate_object (scope=0xfff76a9980, reloc_mode=<optimized out>, consider_profiling=0) at dl-reloc.c:259

#5  0x000000fff767ba10 in dl_main (phdr=<optimized out>, phdr@entry=0x120000040, phnum=<optimized out>, phnum@entry=8,

    user_entry=user_entry@entry=0xffffa15cf0, auxv=<optimized out>) at rtld.c:2070

#6  0x000000fff7692e3c in _dl_sysdep_start (start_argptr=<optimized out>, dl_main=0xfff7679a98 <dl_main>) at ../elf/dl-sysdep.c:249

#7  0x000000fff767d0d8 in _dl_start_final (arg=arg@entry=0xffffa16410, info=info@entry=0xffffa15d80) at rtld.c:307

#8  0x000000fff767d3d8 in _dl_start (arg=0xffffa16410) at rtld.c:415

#9  0x000000fff7679380 in __start () from /lib64/ld.so.1

 

Please see if this could help in analysing the issue.

 

Regards,

Sandhya

 

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Friday, July 07, 2017 1:53 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-bugs <pgsql-bugs@postgresql.org>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

 

On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com> wrote:

Hi Craig,

 

The scenario is lock and unlock of the system for 30 times. During this scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different locations.

I have attached output for 2 such instance.

 

 

You seem to be missing debug symbols. Install appropriate debuginfo packages.

 

--

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Postgres process invoking exit resulting in sh-QUITcore

From
"K S, Sandhya (Nokia - IN/Bangalore)"
Date:

Hi Craig,

 

While testing for another scenario of continuous postgres server restart, we got many cores of sh-QUIT and along with that we got cores for rm-QUIT. It is pointing to rm of the archive file but we were not able to get the bt as the stack is corrupted.

 

We got below info from gdb:

Core was generated by `rm ./Archive_000000020000000000000118'.

 

And also we were able to get this info:

4518     12490  0.0  0.0  11484  1356 ?        Ss   10:59   0:00 postgres: archiver process   archiving 000000020000000000000118.00000028.backup

4518     12704  2.0  0.0   7672  2932 ?        S    11:00   0:00       \_ sh -c rm ./Archive_*; touch ./Archive_"000000020000000000000118.00000028.backup"; exit 0

4518     12707  0.0  0.0    344     4 ?        S    11:00   0:00                 \_ rm ./Archive_000000020000000000000118

 

In the Postgres configuration file ,we have this information.

archive_command             = 'rm ./Archive_*; touch ./Archive_"%f"; exit 0'

 

So while executing this archive command, core was generated.

You pointed out earlier that issue might be happening during archive command and also all evidence for this crash are pointing to this same command.

Are there any suggestions to recover from this situation or on ways to debug the issue ?

 

Regards,

Sandhya

 

From: K S, Sandhya (Nokia - IN/Bangalore)
Sent: Wednesday, July 12, 2017 4:51 PM
To: 'Craig Ringer' <craig@2ndquadrant.com>
Cc: pgsql-bugs <pgsql-bugs@postgresql.org>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
Subject: RE: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

 

Hi Craig,

 

Here is bt after installing all the missing debuginfo packages.

 

(gdb) bt

#0  0x000000fff7682f18 in do_lookup_x (undef_name=undef_name@entry=0xfff75cece5 "_Jv_RegisterClasses", new_hash=new_hash@entry=2681263574,

    old_hash=old_hash@entry=0xffffa159b8, ref=0xfff75ceac8, result=result@entry=0xffffa159a0, scope=<optimized out>, i=1, version=version@entry=0x0,

    flags=flags@entry=1, skip=skip@entry=0x0, type_class=type_class@entry=0, undef_map=undef_map@entry=0xfff76a9478) at dl-lookup.c:444

#1  0x000000fff76839a0 in _dl_lookup_symbol_x (undef_name=0xfff75cece5 "_Jv_RegisterClasses", undef_map=0xfff76a9478, ref=0xffffa15a90,

    symbol_scope=0xfff76a9980, version=0x0, type_class=<optimized out>, flags=<optimized out>, skip_map=0x0) at dl-lookup.c:833

#2  0x000000fff7685730 in elf_machine_got_rel (lazy=1, map=0xfff76a9478) at ../sysdeps/mips/dl-machine.h:870

#3  elf_machine_runtime_setup (profile=<optimized out>, lazy=1, l=0xfff76a9478) at ../sysdeps/mips/dl-machine.h:916

#4  _dl_relocate_object (scope=0xfff76a9980, reloc_mode=<optimized out>, consider_profiling=0) at dl-reloc.c:259

#5  0x000000fff767ba10 in dl_main (phdr=<optimized out>, phdr@entry=0x120000040, phnum=<optimized out>, phnum@entry=8,

    user_entry=user_entry@entry=0xffffa15cf0, auxv=<optimized out>) at rtld.c:2070

#6  0x000000fff7692e3c in _dl_sysdep_start (start_argptr=<optimized out>, dl_main=0xfff7679a98 <dl_main>) at ../elf/dl-sysdep.c:249

#7  0x000000fff767d0d8 in _dl_start_final (arg=arg@entry=0xffffa16410, info=info@entry=0xffffa15d80) at rtld.c:307

#8  0x000000fff767d3d8 in _dl_start (arg=0xffffa16410) at rtld.c:415

#9  0x000000fff7679380 in __start () from /lib64/ld.so.1

 

Please see if this could help in analysing the issue.

 

Regards,

Sandhya

 

From: Craig Ringer [mailto:craig@2ndquadrant.com]
Sent: Friday, July 07, 2017 1:53 PM
To: K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com>
Cc: pgsql-bugs <pgsql-bugs@postgresql.org>; PostgreSQL Hackers <pgsql-hackers@postgresql.org>; T, Rasna (Nokia - IN/Bangalore) <rasna.t@nokia.com>; Itnal, Prakash (Nokia - IN/Bangalore) <prakash.itnal@nokia.com>
Subject: Re: [HACKERS] Postgres process invoking exit resulting in sh-QUIT core

 

On 7 July 2017 at 15:41, K S, Sandhya (Nokia - IN/Bangalore) <sandhya.k_s@nokia.com> wrote:

Hi Craig,

 

The scenario is lock and unlock of the system for 30 times. During this scenario 5 sh-QUIT core is generated. GDB of 5 core is pointing to different locations.

I have attached output for 2 such instance.

 

 

You seem to be missing debug symbols. Install appropriate debuginfo packages.

 

--

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services