Thread: Troubleshooting a segfault and instance crash

Troubleshooting a segfault and instance crash

From

Blair Boadway

Date:

08 March 2018, 23:40:09

Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]

Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;

grant usage on schema app to app_user2;

...

Our set up is

RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64

PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit

Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.

Regards,

Blair

Re: Troubleshooting a segfault and instance crash

From

Pavel Stehule

Date:

08 March 2018, 23:47:43

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:

Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.

can you get core dump? It can be pgaudit bug maybe? It is complex extension.

Regards

Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

From

Blair Boadway

Date:

09 March 2018, 00:16:01

Hi Pavel,

I don’t have a core yet, the only way I have now is to intentionally crash the prod system a couple of times. Haven’t resorted to that yet.

Interesting you mentioned pgaudit—it is installed on this system because that is a our standard installation but on this particular system we haven’t yet needed audits so the audit role is ‘empty’. (And on a different system with same installation and heavy of audit we’ve seen no segfaults)

On this system

pgaudit.role = 'auditor'

pgaudit.log_parameter = off

pgaudit.log_catalog = off

pgaudit.log_statement_once = on

pgaudit.log_level = log

select * from information_schema.role_table_grants where grantee = 'auditor';

(0 rows)

thanks, Blair

From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:

Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.

can you get core dump? It can be pgaudit bug maybe? It is complex extension.

Regards

Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

From

Pavel Stehule

Date:

09 March 2018, 00:34:08

2018-03-08 19:16 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:

Hi Pavel,

I don’t have a core yet, the only way I have now is to intentionally crash the prod system a couple of times. Haven’t resorted to that yet.

hard to help without backtrace - and then you need core dump

Interesting you mentioned pgaudit—it is installed on this system because that is a our standard installation but on this particular system we haven’t yet needed audits so the audit role is ‘empty’. (And on a different system with same installation and heavy of audit we’ve seen no segfaults)

other extensions are simply or without relation to DDL or well known. So pgaudit is best candidate - but the error can be anywhere

Regards

Pavel

On this system

pgaudit.role = 'auditor'
pgaudit.log_parameter = off
pgaudit.log_catalog = off
pgaudit.log_statement_once = on
pgaudit.log_level = log

select * from information_schema.role_table_grants where grantee = 'auditor';
(0 rows)

thanks, Blair

From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

Hi

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.

can you get core dump? It can be pgaudit bug maybe? It is complex extension.
Regards

Pavel

Regards,
Blair

RE: Troubleshooting a segfault and instance crash

From

Jan Bilek

Date:

09 March 2018, 04:55:29

Hi Blair, Pavel,

we are using procedure described in https://access.redhat.com/solutions/4896 to automate crash detail collection for our production systems on RHEL 7.

Perhaps something like this can help on your side.

Kind Regards,
Jan

On 2018-03-09 04:35:05+10:00 Pavel Stehule wrote:

2018-03-08 19:16 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hi Pavel,
I don’t have a core yet, the only way I have now is to intentionally crash the prod system a couple of times. Haven’t resorted to that yet.
hard to help without backtrace - and then you need core dump

Interesting you mentioned pgaudit—it is installed on this system because that is a our standard installation but on this particular system we haven’t yet needed audits so the audit role is ‘empty’. (And on a different system with same installation and heavy of audit we’ve seen no segfaults)
other extensions are simply or without relation to DDL or well known. So pgaudit is best candidate - but the error can be anywhere

Regards

Pavel
On this system
pgaudit.role = 'auditor'
pgaudit.log_parameter = off
pgaudit.log_catalog = off
pgaudit.log_statement_once = on
pgaudit.log_level = log
select * from information_schema.role_table_grants where grantee = 'auditor';
(0 rows)
thanks, Blair
From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash
Hi

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hello,
We’re seeing an occasional segfault on a particular database
Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault
It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)
grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...
Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical
So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.
can you get core dump? It can be pgaudit bug maybe? It is complex extension.
Regards
Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

From

Blair Boadway

Date:

25 March 2018, 04:44:14

Following up on this thread, we removed pgaudit from the system to eliminate on variable (removed from postgres.conf including shared_preload_libraries) but after a couple of weeks of success we hit the segfault again. Again it happened while running some DDL (object grants). This time we were configured to harvest a core file, which gave us a small bit of info:

gdb -q -c core /usr/pgsql-9.6/bin/postgres

Reading symbols from /usr/pgsql-9.6/bin/postgres...(no debugging symbols found)...done.

Core was generated by `postgres: batch_user_account'.

Program terminated with signal 11, Segmentation fault.

#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

That wasn’t really enough information to tell me what the problem. Did not have success with installing debuginfo:

Could not find debuginfo for main pkg: postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

Not sure how useful that would be to dig further on. So it doesn’t seem pgaudit is the culprit but not sure what to make of the strcmp error.

-Blair

From: Jan Bilek <jan.bilek@eftlab.com.au>
Date: Thursday, March 8, 2018 at 2:56 PM
To: "pavel.stehule@gmail.com" <pavel.stehule@gmail.com>, Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: RE: Troubleshooting a segfault and instance crash

2018-03-08 19:16 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hi Pavel,

I don’t have a core yet, the only way I have now is to intentionally crash the prod system a couple of times. Haven’t resorted to that yet.
hard to help without backtrace - and then you need core dump

Interesting you mentioned pgaudit—it is installed on this system because that is a our standard installation but on this particular system we haven’t yet needed audits so the audit role is ‘empty’. (And on a different system with same installation and heavy of audit we’ve seen no segfaults)
other extensions are simply or without relation to DDL or well known. So pgaudit is best candidate - but the error can be anywhere

Regards

Pavel
On this system

pgaudit.role = 'auditor'
pgaudit.log_parameter = off
pgaudit.log_catalog = off
pgaudit.log_statement_once = on
pgaudit.log_level = log

select * from information_schema.role_table_grants where grantee = 'auditor';
(0 rows)

thanks, Blair

From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash
Hi

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.
can you get core dump? It can be pgaudit bug maybe? It is complex extension.
Regards
Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

From

Peter Geoghegan

Date:

25 March 2018, 05:17:43

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway <bboadway@abebooks.com> wrote:
> Mar  7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
> 000000302f32868a sp 00007ffcf1547498 error 4 in
> libc-2.12.so[302f200000+18a000]
>
> Mar  7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
> server process (PID 29351) was terminated by signal 11: Segmentation fault

> It crashes the database, though it starts again on its own without any
> apparent issues.  This has happened 3 times in 2 months and each time the
> segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,
that seemed like it might originate from within its glibc [1].
Although all the versions there didn't match what you have, it's worth
considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have
the necessary debuginfo repos set up. Just a guess. That is sometimes
a required extra step.

[1] https://postgr.es/m/7369.1520528405@sss.pgh.pa.us
--
Peter Geoghegan

Re: Troubleshooting a segfault and instance crash

From

Blair Boadway

Date:

25 March 2018, 05:41:53

Thanks for the tip. We are using RHEL 6.9 and definitely up to date on glibc (2.12-1.209.el6_9.2). We also have the same versions on a very similar system with no segfault.

My colleague got a better backtrace that shows another extension

Core was generated by `postgres: batch_user_account''.

Program terminated with signal 11, Segmentation fault.

#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

(gdb) bt

#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6

#1 0x00007fa3f0c7074c in get_query_string (pstate=<value optimized out>, query=<value optimized out>, jumblequery=<value optimized out>) at pg_hint_plan.c:1882

#2 0x00007fa3f0c70a5d in pg_hint_plan_post_parse_analyze (pstate=0x25324b8, query=0x25325e8) at pg_hint_plan.c:2875

#3 0x00000000005203bc in parse_analyze ()

#4 0x00000000006df933 in pg_analyze_and_rewrite ()

#5 0x00000000007c6f6b in ?? ()

#6 0x00000000007c6ff0 in CachedPlanGetTargetList ()

#7 0x00000000006e173a in PostgresMain ()

#8 0x00000000006812f5 in PostmasterMain ()

#9 0x0000000000609278 in main ().

We aren’t sure if this indicates that pg_hint_plan is causing the segfault or if it happened to be doing something when the segfault occurred. We aren’t actually using pg_hint_plan hints in this system so we’re not sure how all this relates to segfault when another process does a ‘grant usage on schema abc to user xyz;’ unrelated to the account segfaulting.

Short of better ideas, we will pull the pg_hint_plan extension and see if that removes the problem.

-Blair

From: Peter Geoghegan <pg@bowt.ie>
Date: Saturday, March 24, 2018 at 4:18 PM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway <bboadway@abebooks.com> wrote:

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
000000302f32868a sp 00007ffcf1547498 error 4 in
libc-2.12.so[302f200000+18a000]

Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any
apparent issues. This has happened 3 times in 2 months and each time the
segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,

that seemed like it might originate from within its glibc [1].

Although all the versions there didn't match what you have, it's worth

considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have

the necessary debuginfo repos set up. Just a guess. That is sometimes

a required extra step.

[1] https://postgr.es/m/7369.1520528405@sss.pgh.pa.us

Peter Geoghegan

RE: Troubleshooting a segfault and instance crash

From

Jan Bilek

Date:

25 March 2018, 07:12:40

Hi Blair,

In regards of that debug package i found it here: http://cbs.centos.org/koji/buildinfo?buildID=20425 , see http://cbs.centos.org/kojifiles/packages/rh-postgresql96-postgresql/9.6.5/1.el7/x86_64/rh-postgresql96-postgresql-debuginfo-9.6.5-1.el7.x86_64.rpm

However I have very little experience with it to provide more instructions.

There is also an option to get your own debugging symbols from building postgresql server with those and stripping them. I've found pretty good example here: http://marcioandreyoliveira.blogspot.com.au/2008/03/how-to-debug-striped-programs-with-gdb.html

Finally, looking into that Seg. Fault, strcmp is pretty common-nasty error. Most probably buffer overflow, where RHEL is more sensitive OS on this then the others. This is where postgresql pays a toll for being written in C and not e.g. C++. It can be practically anywhere.

Anyway, by being able to link those debugging symbols to your core dump, we should immediately see where it is and you'll do a great help to the community. I'm sure that then Pavel will be able to issue a fix in a matter of minutes ;)

Kind Regards,
Jan

--
Jan Bilek

CTO, EFTLab

M: +61 (0) 498 103 179
E: jan.bilek@eftlab.com.au
A: 109 Brighton Road, Sandgate, QLD 4017

IMPORTANT NOTICE
This message contains confidential information and is intended only for the addressee(s). E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. EFTlab Pty Ltd cannot accept liability for any errors or omissions in the contents of this message, which may arise as a result of e-mail transmission. Please note that EFTlab Pty Ltd may monitor, analyse and archive email traffic, data and the content of email for the purposes of security, legal compliance and staff training. If you have received this email in error please notify us at support@eftlab.com.au.

On 2018-03-25 08:44:21+10:00 Blair Boadway wrote:

Following up on this thread, we removed pgaudit from the system to eliminate on variable (removed from postgres.conf including shared_preload_libraries) but after a couple of weeks of success we hit the segfault again. Again it happened while running some DDL (object grants). This time we were configured to harvest a core file, which gave us a small bit of info:

gdb -q -c core /usr/pgsql-9.6/bin/postgres
Reading symbols from /usr/pgsql-9.6/bin/postgres...(no debugging symbols found)...done.
<many more lines such as this with no debugging symbols found>
Core was generated by `postgres: batch_user_account'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

That wasn’t really enough information to tell me what the problem. Did not have success with installing debuginfo:

Could not find debuginfo for main pkg: postgresql96-server-9.6.5-1PGDG.rhel6.x86_64

Not sure how useful that would be to dig further on. So it doesn’t seem pgaudit is the culprit but not sure what to make of the strcmp error.

-Blair

From: Jan Bilek <jan.bilek@eftlab.com.au>
Date: Thursday, March 8, 2018 at 2:56 PM
To: "pavel.stehule@gmail.com" <pavel.stehule@gmail.com>, Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: RE: Troubleshooting a segfault and instance crash

Hi Blair, Pavel,

we are using procedure described in https://access.redhat.com/solutions/4896 to automate crash detail collection for our production systems on RHEL 7.

Perhaps something like this can help on your side.

Kind Regards,
Jan

On 2018-03-09 04:35:05+10:00 Pavel Stehule wrote:

2018-03-08 19:16 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hi Pavel,

I don’t have a core yet, the only way I have now is to intentionally crash the prod system a couple of times. Haven’t resorted to that yet.
hard to help without backtrace - and then you need core dump

Interesting you mentioned pgaudit—it is installed on this system because that is a our standard installation but on this particular system we haven’t yet needed audits so the audit role is ‘empty’. (And on a different system with same installation and heavy of audit we’ve seen no segfaults)
other extensions are simply or without relation to DDL or well known. So pgaudit is best candidate - but the error can be anywhere

Regards

Pavel
On this system

pgaudit.role = 'auditor'
pgaudit.log_parameter = off
pgaudit.log_catalog = off
pgaudit.log_statement_once = on
pgaudit.log_level = log

select * from information_schema.role_table_grants where grantee = 'auditor';
(0 rows)

thanks, Blair

From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Thursday, March 8, 2018 at 9:49 AM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash
Hi

2018-03-08 18:40 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:
Hello,

We’re seeing an occasional segfault on a particular database

Mar 7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip 000000302f32868a sp 00007ffcf1547498 error 4 in libc-2.12.so[302f200000+18a000]
Mar 7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG: server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any apparent issues. This has happened 3 times in 2 months and each time the segfault error and memory address is the same. We’ve only seen it on one database, though we’ve seen it on both hosts of primary/standby setup—we switched over primary to other host and got a segfault there, which seems to eliminate a hardware issue. Oddly the database has no issues for normal DML workloads (it is a moderately busy prod oltp system) but the segfault has happened very shortly after DML changes are made. Most recently it happened while running a series of grants for new db users we were deploying (ie. running a sql script from psql on the primary host)

grant usage on schema app to app_user1;
grant usage on schema app to app_user2;
...

Our set up is
RHEL 6.9 - 2.6.32-696.16.1.el6.x86_64
PostgreSQL 9.6.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
Extensions - pg_cron,repmgr_funcs,pgaudit,pg_stat_statements,pg_hint_plan,pglogical

So far can’t reproduce on a test system, have just added some OS config to collect core from the OS but haven’t collected a core yet. There isn’t any particular config change or extension that we can link to the problem, this is a system that has run for months without problems since last config changes. Appreciate any ideas.
can you get core dump? It can be pgaudit bug maybe? It is complex extension.
Regards
Pavel

Regards,
Blair

Re: Troubleshooting a segfault and instance crash

From

Pavel Stehule

Date:

25 March 2018, 10:18:11

2018-03-25 0:41 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:

Thanks for the tip. We are using RHEL 6.9 and definitely up to date on glibc (2.12-1.209.el6_9.2). We also have the same versions on a very similar system with no segfault.

My colleague got a better backtrace that shows another extension

Core was generated by `postgres: batch_user_account''.
Program terminated with signal 11, Segmentation fault.
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
#1 0x00007fa3f0c7074c in get_query_string (pstate=<value optimized out>, query=<value optimized out>, jumblequery=<value optimized out>) at pg_hint_plan.c:1882
#2 0x00007fa3f0c70a5d in pg_hint_plan_post_parse_analyze (pstate=0x25324b8, query=0x25325e8) at pg_hint_plan.c:2875
#3 0x00000000005203bc in parse_analyze ()
#4 0x00000000006df933 in pg_analyze_and_rewrite ()
#5 0x00000000007c6f6b in ?? ()
#6 0x00000000007c6ff0 in CachedPlanGetTargetList ()
#7 0x00000000006e173a in PostgresMain ()
#8 0x00000000006812f5 in PostmasterMain ()
#9 0x0000000000609278 in main ().

We aren’t sure if this indicates that pg_hint_plan is causing the segfault or if it happened to be doing something when the segfault occurred. We aren’t actually using pg_hint_plan hints in this system so we’re not sure how all this relates to segfault when another process does a ‘grant usage on schema abc to user xyz;’ unrelated to the account segfaulting.

although you don't use pg_hint_plan explicitly, pg_hint_plan is active - it is active via planner callbacks

Short of better ideas, we will pull the pg_hint_plan extension and see if that removes the problem.

please, try to report this back trace to pg_hint_plan authors.

Regards

Pavel

-Blair

From: Peter Geoghegan <pg@bowt.ie>
Date: Saturday, March 24, 2018 at 4:18 PM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway <bboadway@abebooks.com> wrote:
Mar  7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
000000302f32868a sp 00007ffcf1547498 error 4 in
libc-2.12.so[302f200000+18a000]

Mar  7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any
apparent issues.  This has happened 3 times in 2 months and each time the
segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,
that seemed like it might originate from within its glibc [1].
Although all the versions there didn't match what you have, it's worth
considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have
the necessary debuginfo repos set up. Just a guess. That is sometimes
a required extra step.

[1] https://postgr.es/m/7369.1520528405@sss.pgh.pa.us
--
Peter Geoghegan

Re: Troubleshooting a segfault and instance crash

From

Blair Boadway

Date:

28 March 2018, 04:47:54

As a follow up, we’ve been able to get the same back trace implicating pg_hint_plan from 2 separate crashes. We were using pg_hint_plan 1.2.2--we reported the issue to pg_hint_plan github. We’ve removed pg_hint_plan and it looks like the system will no longer segfault under the same conditions. This strongly suggests pg_hint_plan was the root cause of our issue but we can’t yet be 100% certain as the issue was always transient.

-Blair

From: Pavel Stehule <pavel.stehule@gmail.com>
Date: Saturday, March 24, 2018 at 9:18 PM
To: Blair Boadway <bboadway@abebooks.com>
Cc: Peter Geoghegan <pg@bowt.ie>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

2018-03-25 0:41 GMT+01:00 Blair Boadway <bboadway@abebooks.com>:

Thanks for the tip. We are using RHEL 6.9 and definitely up to date on glibc (2.12-1.209.el6_9.2). We also have the same versions on a very similar system with no segfault.

My colleague got a better backtrace that shows another extension

Core was generated by `postgres: batch_user_account''.
Program terminated with signal 11, Segmentation fault.
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql96-server-9.6.5-1PGDG.rhel6.x86_64
(gdb) bt
#0 0x000000386712868a in __strcmp_sse42 () from /lib64/libc.so.6
#1 0x00007fa3f0c7074c in get_query_string (pstate=<value optimized out>, query=<value optimized out>, jumblequery=<value optimized out>) at pg_hint_plan.c:1882
#2 0x00007fa3f0c70a5d in pg_hint_plan_post_parse_analyze (pstate=0x25324b8, query=0x25325e8) at pg_hint_plan.c:2875
#3 0x00000000005203bc in parse_analyze ()
#4 0x00000000006df933 in pg_analyze_and_rewrite ()
#5 0x00000000007c6f6b in ?? ()
#6 0x00000000007c6ff0 in CachedPlanGetTargetList ()
#7 0x00000000006e173a in PostgresMain ()
#8 0x00000000006812f5 in PostmasterMain ()
#9 0x0000000000609278 in main ().

We aren’t sure if this indicates that pg_hint_plan is causing the segfault or if it happened to be doing something when the segfault occurred. We aren’t actually using pg_hint_plan hints in this system so we’re not sure how all this relates to segfault when another process does a ‘grant usage on schema abc to user xyz;’ unrelated to the account segfaulting.

although you don't use pg_hint_plan explicitly, pg_hint_plan is active - it is active via planner callbacks

Short of better ideas, we will pull the pg_hint_plan extension and see if that removes the problem.

please, try to report this back trace to pg_hint_plan authors.

Regards

Pavel

-Blair

From: Peter Geoghegan <pg@bowt.ie>
Date: Saturday, March 24, 2018 at 4:18 PM
To: Blair Boadway <bboadway@abebooks.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Subject: Re: Troubleshooting a segfault and instance crash

On Thu, Mar 8, 2018 at 9:40 AM, Blair Boadway <bboadway@abebooks.com> wrote:
Mar  7 14:46:35 pgprod2 kernel:postgres[29351]: segfault at 0 ip
000000302f32868a sp 00007ffcf1547498 error 4 in
libc-2.12.so[302f200000+18a000]

Mar  7 14:46:35 pgprod2 POSTGRES[21262]: [5] user=,db=,app=client= LOG:
server process (PID 29351) was terminated by signal 11: Segmentation fault

It crashes the database, though it starts again on its own without any
apparent issues.  This has happened 3 times in 2 months and each time the
segfault error and memory address is the same.

We had a recent report of a segfault on a Redhat compatible system,
that seemed like it might originate from within its glibc [1].
Although all the versions there didn't match what you have, it's worth
considering as a possibility.

Maybe you can't install debuginfo packages because you don't yet have
the necessary debuginfo repos set up. Just a guess. That is sometimes
a required extra step.

[1] https://postgr.es/m/7369.1520528405@sss.pgh.pa.us
--
Peter Geoghegan