Thread: Segfault on exclusion constraint violation
9.4 and master segfaults, if an insertion would need to wait for another transaction to finish because of an exclusion constraint. To reproduce: Run these in session A: create extension btree_gist; create table foo (i int4, constraint i_exclude exclude using gist (i with =)); begin; insert into foo values (1); leave the transaction open, and session B: insert into foo values (1); LOG: server process (PID 3690) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: insert into foo values (1); LOG: terminating any other active server processes gdb backtrace: #0 0x000000000078520d in XactLockTableWait (xid=705, rel=0x7f2f6e835728, ctid=0x7f7f7f7f7f7f7f8b, oper=XLTW_RecheckExclusionConstr) at lmgr.c:515 #1 0x000000000064bd86 in check_exclusion_constraint (heap=0x7f2f6e835728, index=0x7f2f6e837620, indexInfo=0x22187c0, tupleid=0x2219514, values=0x7fffae880a10, isnull=0x7fffae8809f0 "", estate=0x2218228, newIndex=0 '\000', errorOK=0 '\000') at execUtils.c:1310 #2 0x000000000064b9a9 in ExecInsertIndexTuples (slot=0x2218500, tupleid=0x2219514, estate=0x2218228) at execUtils.c:1126 #3 0x000000000065f8c4 in ExecInsert (slot=0x2218500, planSlot=0x2218500, estate=0x2218228, canSetTag=1 '\001') at nodeModifyTable.c:274 This only happens with assertions enabled. The culprit is commit f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid' argument to XactLockTableWait. check_exclusion_constraint calls index_endscan() just before XactLockTableWait, but that free's the memory the ctid points to. The fix for this particular instance is trivial: copy the ctid to a local variable before calling index_endscan. However, looking at the other XactLockTableWait() and MultiXactIdWait() calls, there are more questionable pointers being passed. Most point to heap tuples on disk pages, after releasing the lock on the page, although not the pin. The one in EvalPlanQualFetch releases the pin too. I'll write up a patch to change those call sites to use local variables. Hopefully it's trivial enough to still include in 9.4.1, although time is really running out.. - Heikki
On 02/02/2015 03:50 PM, Heikki Linnakangas wrote: > This only happens with assertions enabled. The culprit is commit > f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid' > argument to XactLockTableWait. check_exclusion_constraint calls > index_endscan() just before XactLockTableWait, but that free's the > memory the ctid points to. > > The fix for this particular instance is trivial: copy the ctid to a > local variable before calling index_endscan. However, looking at the > other XactLockTableWait() and MultiXactIdWait() calls, there are more > questionable pointers being passed. Most point to heap tuples on disk > pages, after releasing the lock on the page, although not the pin. The > one in EvalPlanQualFetch releases the pin too. > > I'll write up a patch to change those call sites to use local variables. > Hopefully it's trivial enough to still include in 9.4.1, although time > is really running out.. I'll commit the attached fix shortly, so please shout quickly if you see a problem with this. Aside from the potential for segfaults with assertions, I think the calls passed incorrect ctid anyway. For example: > --- a/src/backend/executor/execUtils.c > +++ b/src/backend/executor/execUtils.c > @@ -1307,7 +1307,7 @@ retry: > if (TransactionIdIsValid(xwait)) > { > index_endscan(index_scan); > - XactLockTableWait(xwait, heap, &tup->t_data->t_ctid, > + XactLockTableWait(xwait, heap, &tup->t_self, > XLTW_RecheckExclusionConstr); > goto retry; > } We don't really want to pass the heap tuple's ctid field. If the tuple was updated (again) in the same transaction, the one that's still in-progress, that points to the *next* tuple in the chain, but the error message says "while checking exclusion constraint on tuple (%u,%u) in relation \"%s\"". We should be passing the TID of the tuple itself, not the ctid value in the tuple's header. The attached patch fixes that too. - Heikki
Attachment
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > 9.4 and master segfaults, if an insertion would need to wait for another > transaction to finish because of an exclusion constraint. To reproduce: > ... > This only happens with assertions enabled. The culprit is commit > f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid' > argument to XactLockTableWait. check_exclusion_constraint calls > index_endscan() just before XactLockTableWait, but that free's the > memory the ctid points to. > I'll write up a patch to change those call sites to use local variables. > Hopefully it's trivial enough to still include in 9.4.1, although time > is really running out.. If the only known bad consequence requires assertions enabled, I think it would be more prudent to *not* try to fix this in haste. regards, tom lane
On 02/02/2015 04:38 PM, Tom Lane wrote: > Heikki Linnakangas <hlinnakangas@vmware.com> writes: >> 9.4 and master segfaults, if an insertion would need to wait for another >> transaction to finish because of an exclusion constraint. To reproduce: >> ... >> This only happens with assertions enabled. The culprit is commit >> f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid' >> argument to XactLockTableWait. check_exclusion_constraint calls >> index_endscan() just before XactLockTableWait, but that free's the >> memory the ctid points to. > >> I'll write up a patch to change those call sites to use local variables. >> Hopefully it's trivial enough to still include in 9.4.1, although time >> is really running out.. > > If the only known bad consequence requires assertions enabled, I think > it would be more prudent to *not* try to fix this in haste. Ok, I'll wait until after the release. - Heikki
We are seeing a similar segfault scenario in 9.4.1 without assertions enabl= ed. We upgraded 4 of our production instances from 9.3.5 to 9.4.1 over the = weekend. We had staged on 9.4.0, and then 9.4.1 for over a month without seeing this= error, but the staging work load is much lower than production. postgres=3D# show debug_assertions; debug_assertions ------------------ off (1 row) postgres=3D# set debug_assertions =3D 'on' ; ERROR: assertion checking is not supported by this build postgres=3D# We believe the error occurred when an insert query was running inside a tr= ansaction, and another query attempted to insert into the same table. Postgres Log says : 2015-03-16 02:43:25.762 PDT,,,5481,,5504121c.1569,3,,2015-03-14 03:49:00 PD= T,,0,LOG,00000,"server process (PID 9471) was terminated by signal 11: Segm= entation fault","Failed process was running: INSERT into ad_instances (adin= stid, cid, kid, adrefid, termid, sync_match_type, status_code, start_date, syn= c_keyword) SELECT nextval('public.ad_instances_adinstid_seq'), ci= d, kw.kid, adg.adrefid, termid, ai_mtid, 'u', now(), sync_keyword FROM tmp_keywords_1100218810 JOIN keywords kw using (keyword) JOIN adgroups adg using (adref,cid) WHERE keyword_type =3D 'n' AND cid =3D 1100218810 GROUP BY cid, kw.kid, adg.adrefid, termid, ai_mtid, sy= nc_keyword, tmp_keywords_1100218810.status_code",,,,,,,,"" Application log says: 2015-03-16 02:43:25.763 PDT,"release","c10036",4349,"op02.lon5.efrontier.co= m:39404",5506a408.10fd,27,"idle in transaction",2015-03-16 02:36:08 PDT,44/= 292306,3267900389,WARNING,57P02,"terminating connection because of crash of= another server process","The postmaster has commanded this server process = to roll back the current transaction and exit, because another server proce= ss exited abnormally and possibly corrupted shared memory.","In a moment yo= u should be able to reconnect to the database and repeat your command.",,,,= ,,,"ad_status" /var/log/messages says: Mar 16 02:40:35 user27 kernel: : postgres[9471]: segfault at 30 ip 00000000= 0066148b sp 00007fffa9c5f5f0 error 4 in postgres[400000+54d000] We have not been able to reproduce the error, but we are testing scenarios = in our staging environment now.
On Wed, Mar 18, 2015 at 12:15 AM, Dennis Pozzi <dpozzi@adobe.com> wrote: > We have not been able to reproduce the error, but we are testing scenarios in our staging environment now. If this is the same problem, this will be fixed in the next minor release as the fix has been committed here after 9.4.1 was released: commit: 57fe246890ad51e166fb6a8da937e41c35d7a279 author: Heikki Linnakangas <heikki.linnakangas@iki.fi> date: Wed, 4 Feb 2015 16:00:34 +0200 Fix reference-after-free when waiting for another xact due to constraint. Now, if you are able to produce a self-contained test case that fails even on HEAD, well that's another story... Regards, -- Michael