Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free - Mailing list pgsql-bugs
| From | Chris Hofstaedtler |
|---|---|
| Subject | Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free |
| Date | |
| Msg-id | adTAq--RZYaHjxP9@zeha.at Whole thread Raw |
| In response to | Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free (Thomas Munro <thomas.munro@gmail.com>) |
| Responses |
Re: BUG #19416: Backend SIGSEGV in ExecShutdownHashJoin/ExecHashTableDetach/dsa_free
|
| List | pgsql-bugs |
* Thomas Munro <thomas.munro@gmail.com> [260328 08:03]: >On Thu, Feb 26, 2026 at 11:28 AM PG Bug reporting form ><noreply@postgresql.org> wrote: >> #0 0x0000564830c3c9d1 in dsa_free (area=0x5648621a9280, dp=<optimized out>) >> at ./build/../src/backend/utils/mmgr/dsa.c:845 >> 845 ./build/../src/backend/utils/mmgr/dsa.c: No such file or directory. >> (gdb) bt full >> #0 0x0000564830c3c9d1 in dsa_free (area=0x5648621a9280, dp=<optimized out>) >> at ./build/../src/backend/utils/mmgr/dsa.c:845 >> segment_map = 0x5648621a92b8 >> pageno = 0 >> span_pointer = 0 >> span = <optimized out> >> superblock = <optimized out> >> object = <optimized out> >> size = <optimized out> >> size_class = <optimized out> >> #1 0x0000564830917ae1 in ExecHashTableDetach (hashtable=0x564862147fa0) at >> ./build/../src/backend/executor/nodeHash.c:3439 >> i = <optimized out> >> pstate = 0x7ef9970b5040 > >Hmm, it's not supposed to be possible for two backends to reach the >dsa_free() at nodeHash.c:3439. But it does smell a bit like that >that's what happened... hmmm. We were in _RUN phase, and then this >barrier should only let one process through to _FREE. I'd be >interested in the contents of pstate->build_barrier in frame 1 if you >get another core file with the same stack in it. Sorry for the lack of followup. Currently we are thinking our problem matches the bug described in https://postgr.es/m/044401dcabac$fe432490$fac96db0$@icloud.com ("[BUG + PATCH] DSA pagemap out-of-bounds in make_new_segment odd-sized path"). >Can you share the >outline of the query plan? Does it happen to have a full/right PHJ >in it? It appears we forgot to capture more of the logs (before they expired). Here is a plan from a smaller machine carrying a similar dataset - unfortunately I cannot guarantee it's the same plan, although it seems close. Indeed it has a "Parallel Hash Right Join" node. Nested Loop (cost=13030498.63..46923224.94 rows=21144 width=36) -> Nested Loop (cost=13030498.19..46911677.41 rows=21144 width=44) -> GroupAggregate (cost=13030497.75..46808339.62 rows=54669 width=2120) Group Key: subscriber.id Filter: (last_agg(nsr.voicemail_present ORDER BY nsr.sort) = 'true'::text) -> Gather Merge (cost=13030497.75..19559670.85 rows=53993659 width=23) Workers Planned: 5 -> Sort (cost=13029497.68..13056494.51 rows=10798732 width=23) Sort Key: subscriber.id, nsr.sort -> Parallel Hash Left Join (cost=10633766.78..11767970.44 rows=10798732 width=23) Hash Cond: (subscriber.id = subscription.subscriber_id) Filter: ((nsr.postpaid IS NULL) OR (nsr.postpaid = (subscription.id IS NOT NULL))) -> Hash Left Join (cost=10585690.41..11619799.44 rows=11184278 width=24) Hash Cond: (subscriber.id = subscriber_options.subscriber_id) Filter: ((nsr.option_ids IS NULL) OR (nsr.option_ids <@ subscriber_options.option_ids)) -> Parallel Hash Right Join (cost=9142123.87..10141640.84 rows=13177920 width=49) Hash Cond: ((optionvalue_lso.subscriber_id = subscriber.id) AND (optionvalue_lso.option_id= nsr.optionvalue_option_id) AND (optionvalue_lso.value = nsr.optionvalue_value)) Filter: ((nsr.optionvalue_option_id IS NULL) OR (optionvalue_lso.id IS NOTNULL)) -> Parallel Seq Scan on lnk_subscriber_option optionvalue_lso (cost=0.00..728660.20rows=1901347 width=21) Filter: ((activationdate <= now()) AND (deletiondate > now())) -> Parallel Hash (cost=8769950.27..8769950.27 rows=13177920 width=62) -> Parallel Hash Join (cost=2888653.49..8769950.27 rows=13177920 width=62) Hash Cond: (lnk_subscriber_simcard.subscriber_id = subscriber.id) Join Filter: (((nsr.brand_id IS NULL) OR (nsr.brand_id = subscriber.brand_id))AND ((nsr.registered IS NULL) OR (nsr.registered = subscriber.registered)) AND ((nsr.customer_category_idsIS NULL) OR (basedata.customercategory_id = ANY (nsr.customer_category_ids))) AND ((nsr.subscriberstatus_idsIS NULL) OR (lnk_subscriber_status.subscriberstatus_id = ANY (nsr.subscriberstatus_ids)))) -> Nested Loop (cost=1148075.17..6707980.64 rows=34348223 width=126) Join Filter: ((simcard.imsi >= nsr.imsi_start) AND (simcard.imsi<= nsr.imsi_end)) -> Parallel Hash Join (cost=1148075.17..1298129.01 rows=1807801width=12) Hash Cond: (lnk_subscriber_simcard.simcard_id = simcard.id) -> Parallel Seq Scan on lnk_subscriber_simcard (cost=0.00..145308.36rows=1807801 width=8) Filter: ((activationdate <= now()) AND (deletiondate> now())) -> Parallel Hash (cost=1118684.52..1118684.52 rows=2351252width=12) -> Parallel Seq Scan on simcard (cost=0.00..1118684.52rows=2351252 width=12) -> Materialize (cost=0.00..7.56 rows=171 width=138) -> Seq Scan on network_settings_rule nsr (cost=0.00..6.71rows=171 width=138) -> Parallel Hash (cost=1717950.74..1717950.74 rows=1810207 width=21) -> Parallel Hash Left Join (cost=778201.62..1717950.74rows=1810207 width=21) Hash Cond: (subscriber.id = lnk_subscriber_basedata.subscriber_id) -> Parallel Hash Join (cost=342951.95..1270264.89rows=1810207 width=17) Hash Cond: (lnk_subscriber_status.subscriber_id= subscriber.id) -> Parallel Bitmap Heap Scan on lnk_subscriber_status (cost=102927.10..1025488.25 rows=1810207 width=8) Recheck Cond: (deletiondate > now()) Filter: (activationdate <= now()) -> Bitmap Index Scan on idx_subscriber_status_deletiondate (cost=0.00..100664.35 rows=9051037 width=0) Index Cond: (deletiondate > now()) -> Parallel Hash (cost=216653.82..216653.82rows=1869682 width=9) -> Parallel Seq Scan on subscriber (cost=0.00..216653.82rows=1869682 width=9) -> Parallel Hash (cost=413397.43..413397.43 rows=1748179width=12) -> Parallel Hash Left Join (cost=228746.63..413397.43rows=1748179 width=12) Hash Cond: (lnk_subscriber_basedata.basedata_id= basedata.id) -> Parallel Seq Scan on lnk_subscriber_basedata (cost=0.00..180061.84 rows=1748179 width=12) Filter: ((now() >= activationdate)AND (now() < deletiondate)) -> Parallel Hash (cost=201487.39..201487.39rows=2180739 width=8) -> Parallel Seq Scan on basedata (cost=0.00..201487.39 rows=2180739 width=8) -> Hash (cost=1434914.41..1434914.41 rows=692170 width=36) -> Subquery Scan on subscriber_options (cost=1345069.22..1434914.41 rows=692170width=36) -> HashAggregate (cost=1345069.22..1427992.71 rows=692170 width=36) Group Key: lnk_subscriber_option.subscriber_id Planned Partitions: 8 -> Index Scan using idx_option_subscriber_combined on lnk_subscriber_option (cost=0.57..1033129.48 rows=9506735 width=8) Index Cond: ((activationdate <= now()) AND (deletiondate> now())) -> Parallel Hash (cost=46189.54..46189.54 rows=150946 width=8) -> Parallel Hash Join (cost=14649.64..46189.54 rows=150946 width=8) Hash Cond: (lnk_subscription_status.subscription_id = subscription.id) -> Hash Join (cost=1.27..31144.93 rows=150946 width=4) Hash Cond: (lnk_subscription_status.subscriptionstatus_id = subscriptionstatus.id) -> Parallel Seq Scan on lnk_subscription_status (cost=0.00..30421.01rows=211324 width=8) Filter: ((activationdate <= now()) AND (deletiondate > now())) -> Hash (cost=1.14..1.14 rows=10 width=4) -> Seq Scan on subscriptionstatus (cost=0.00..1.14 rows=10 width=4) Filter: activestate -> Parallel Hash (cost=11245.39..11245.39 rows=272239 width=8) -> Parallel Seq Scan on subscription (cost=0.00..11245.39 rows=272239width=8) -> Index Scan using idx_lnk_subscriber_msisdn_subscriber_deletiondate on lnk_subscriber_msisdn (cost=0.44..1.87rows=1 width=8) Index Cond: ((subscriber_id = subscriber.id) AND (deletiondate > now())) Filter: (activationdate <= now()) -> Index Scan using pkey_msisdn on msisdn (cost=0.43..0.53 rows=1 width=20) Index Cond: (id = lnk_subscriber_msisdn.msisdn_id) For now our plan is to wait for the next 18.x stable release, and then reenable enable_parallel_hash. Then see if it occours again. As a side-note, my colleagues noted that the problematic query has drastic better timings with enable_parallel_hash = off. Something on our list for future exploration. Best, Chris
pgsql-bugs by date: