Thread: Segmentation fault with parallelism PG 10.4
Hi All,
We recently upgraded to Postgresql 10.4 from 9.6.7.
We allow some user querying against the database and a user wrote sql that would repeatedly cause a segmentation fault. We can’t share the data but I will show what I can.
Here is an example from the postgres logs with the sql simplified but still throwing the error. It is not very well written but we can’t have a bad query bring down the instance.
<2018-06-12 08:59:28 PDT [9109]: [38-1] db=, user=, host=> LOG: server process (PID 11471) was terminated by signal 11: Segmentation fault
<2018-06-12 08:59:28 PDT [9109]: [39-1] db=, user=, host=> DETAIL: Failed process was running: SELECT p.vital_status, c.addr_at_dx_street_name
FROM ctc c
INNER JOIN patient p ON c.pat_id = p.pat_id
LEFT JOIN ctc_registry cr ON c.ctc_id = cr.ctc_id
LEFT JOIN facility_admission fa ON fa.ctc_id = cr.ctc_id AND fa.fac_id = cr.follow_up_hospital
LEFT JOIN patient_reference_id pr ON pr.pat_id = p.pat_id AND pr.fac_id = cr.follow_up_hospital
WHERE
c.primary_site between 'C500' and 'C509'
AND c.date_of_diagnosis_mm between case when c.date_of_diagnosis_yyyy = 2006 then 7 else 1 end and case when c.date_of_diagnosis_yyyy = 2015 then 6 else 12 end
AND (c.sequence_number = '00'
OR (c.sequence_number = '02'
AND p.pat_id in (select pat_id
from ctc
where behavior_icdo3 = 2
and sequence_number = '01'
and deleted = 0
and pat_id in (select pat_id
from ctc
where deleted = 0)))
)
;
We changed max_parallel_workers_per_gather to 0 from the default of 2 and the sql runs fine. I attached the explain plans with parallel turned on and off to see if that helps.
We would obviously like to be able to run the system with parallel turned on.
Thanks,
Manfred
Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are not the addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender of the error.
Attachment
"Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: > We allow some user querying against the database and a user wrote sql that would repeatedly cause a segmentation fault.We can't share the data but I will show what I can. The info you've provided is not of much help, but if you could show a stack trace from the point of the segfault, that might narrow things down. https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
This is what I see in a stack trace Program received signal SIGSEGV, Segmentation fault. 0x000000000084dcf3 in pglz_decompress () -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, June 12, 2018 1:45 PM To: Heinemann, Manfred (IMS) <HeinemannM@imsweb.com> Cc: pgsql-admin@lists.postgresql.org Subject: Re: Segmentation fault with parallelism PG 10.4 "Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: > We allow some user querying against the database and a user wrote sql that would repeatedly cause a segmentation fault.We can't share the data but I will show what I can. The info you've provided is not of much help, but if you could show a stack trace from the point of the segfault, that mightnarrow things down. https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane ________________________________ Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are notthe addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copyingof this communication is strictly prohibited. If you have received this e-mail in error, please notify the senderof the error.
Is this stack trace helpful at all? Is there anything else I could do to investigate? Thanks, Manfred >This is what I see in a stack trace > >Program received signal SIGSEGV, Segmentation fault. >0x000000000084dcf3 in pglz_decompress () > >>-----Original Message----- >>From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >>Sent: Tuesday, June 12, 2018 1:45 PM >>To: Heinemann, Manfred (IMS) <HeinemannM@imsweb.com> >>Cc: pgsql-admin@lists.postgresql.org >>Subject: Re: Segmentation fault with parallelism PG 10.4 >> >>"Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: >>> We allow some user querying against the database and a user wrote sql that would repeatedly cause a segmentation fault.We can't share the data but I will show what I can. >> >>The info you've provided is not of much help, but if you could show a stack trace from the point of the segfault, thatmight narrow things down. >> >>https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend >> >>regards, tom lane ________________________________ Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are notthe addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copyingof this communication is strictly prohibited. If you have received this e-mail in error, please notify the senderof the error. ________________________________ Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are notthe addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copyingof this communication is strictly prohibited. If you have received this e-mail in error, please notify the senderof the error.
"Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: > Is this stack trace helpful at all? No, because it isn't actually a stack trace, just the top frame. See https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
Here is what I get for the stack trace, is it more helpful? Program received signal SIGSEGV, Segmentation fault. 0x000000000084dcb0 in pglz_decompress () #0 0x000000000084dcb0 in pglz_decompress () #1 0x00000000004b8d2f in toast_decompress_datum () #2 0x0000000000793ce0 in numeric_eq () #3 0x00000000005e68eb in ExecInterpExpr () #4 0x00000000005ef589 in ExecScan () #5 0x0000000000608784 in ExecNestLoop () #6 0x00000000005fc6b5 in ExecGather () #7 0x00000000005feb35 in ExecHashJoin () #8 0x00000000006086d4 in ExecNestLoop () #9 0x00000000005e9b02 in standard_ExecutorRun () #10 0x000000000071305b in PortalRunSelect () #11 0x00000000007142d1 in PortalRun () #12 0x000000000071038b in exec_simple_query () #13 0x0000000000711589 in PostgresMain () #14 0x0000000000476399 in ServerLoop () #15 0x00000000006a9acc in PostmasterMain () #16 0x000000000062b58c in main () Thanks, Manfred >"Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: >> Is this stack trace helpful at all? > >No, because it isn't actually a stack trace, just the top frame. See > >https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend > >regards, tom lane ________________________________ Information in this e-mail may be confidential. It is intended only for the addressee(s) identified above. If you are notthe addressee(s), or an employee or agent of the addressee(s), please note that any dissemination, distribution, or copyingof this communication is strictly prohibited. If you have received this e-mail in error, please notify the senderof the error.
"Heinemann, Manfred (IMS)" <HeinemannM@imsweb.com> writes: > Here is what I get for the stack trace, is it more helpful? > Program received signal SIGSEGV, Segmentation fault. > 0x000000000084dcb0 in pglz_decompress () > #0 0x000000000084dcb0 in pglz_decompress () > #1 0x00000000004b8d2f in toast_decompress_datum () > #2 0x0000000000793ce0 in numeric_eq () > #3 0x00000000005e68eb in ExecInterpExpr () > #4 0x00000000005ef589 in ExecScan () > #5 0x0000000000608784 in ExecNestLoop () > #6 0x00000000005fc6b5 in ExecGather () > #7 0x00000000005feb35 in ExecHashJoin () > #8 0x00000000006086d4 in ExecNestLoop () > #9 0x00000000005e9b02 in standard_ExecutorRun () > #10 0x000000000071305b in PortalRunSelect () > #11 0x00000000007142d1 in PortalRun () > #12 0x000000000071038b in exec_simple_query () > #13 0x0000000000711589 in PostgresMain () > #14 0x0000000000476399 in ServerLoop () > #15 0x00000000006a9acc in PostmasterMain () > #16 0x000000000062b58c in main () Hmm. So (1) apparently we have some corrupt data in a numeric column, and (2) it looks like the crash is happening in a parallel leader process, not a worker process, which makes it pretty mystifying why it's connected to parallelism at all. But we're not going to be able to get far with that amount of info. Is there any chance of extracting a self-contained test case? regards, tom lane