Thread: Segfault on exclusion constraint violation

Segfault on exclusion constraint violation

From
Heikki Linnakangas
Date:
9.4 and master segfaults, if an insertion would need to wait for another
transaction to finish because of an exclusion constraint. To reproduce:

Run these in session A:

create extension btree_gist;
create table foo (i int4, constraint i_exclude exclude using gist (i
with =));
begin; insert into foo values (1);

leave the transaction open, and session B:

insert into foo values (1);


LOG:  server process (PID 3690) was terminated by signal 11:
Segmentation fault
DETAIL:  Failed process was running: insert into foo values (1);
LOG:  terminating any other active server processes

gdb backtrace:

#0  0x000000000078520d in XactLockTableWait (xid=705, rel=0x7f2f6e835728,
     ctid=0x7f7f7f7f7f7f7f8b, oper=XLTW_RecheckExclusionConstr) at
lmgr.c:515
#1  0x000000000064bd86 in check_exclusion_constraint (heap=0x7f2f6e835728,
     index=0x7f2f6e837620, indexInfo=0x22187c0, tupleid=0x2219514,
     values=0x7fffae880a10, isnull=0x7fffae8809f0 "", estate=0x2218228,
     newIndex=0 '\000', errorOK=0 '\000') at execUtils.c:1310
#2  0x000000000064b9a9 in ExecInsertIndexTuples (slot=0x2218500,
     tupleid=0x2219514, estate=0x2218228) at execUtils.c:1126
#3  0x000000000065f8c4 in ExecInsert (slot=0x2218500, planSlot=0x2218500,
     estate=0x2218228, canSetTag=1 '\001') at nodeModifyTable.c:274


This only happens with assertions enabled. The culprit is commit
f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
argument to XactLockTableWait. check_exclusion_constraint calls
index_endscan() just before XactLockTableWait, but that free's the
memory the ctid points to.

The fix for this particular instance is trivial: copy the ctid to a
local variable before calling index_endscan. However, looking at the
other XactLockTableWait() and MultiXactIdWait() calls, there are more
questionable pointers being passed. Most point to heap tuples on disk
pages, after releasing the lock on the page, although not the pin. The
one in EvalPlanQualFetch releases the pin too.

I'll write up a patch to change those call sites to use local variables.
Hopefully it's trivial enough to still include in 9.4.1, although time
is really running out..

- Heikki

Re: Segfault on exclusion constraint violation

From
Heikki Linnakangas
Date:
On 02/02/2015 03:50 PM, Heikki Linnakangas wrote:
> This only happens with assertions enabled. The culprit is commit
> f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
> argument to XactLockTableWait. check_exclusion_constraint calls
> index_endscan() just before XactLockTableWait, but that free's the
> memory the ctid points to.
>
> The fix for this particular instance is trivial: copy the ctid to a
> local variable before calling index_endscan. However, looking at the
> other XactLockTableWait() and MultiXactIdWait() calls, there are more
> questionable pointers being passed. Most point to heap tuples on disk
> pages, after releasing the lock on the page, although not the pin. The
> one in EvalPlanQualFetch releases the pin too.
>
> I'll write up a patch to change those call sites to use local variables.
> Hopefully it's trivial enough to still include in 9.4.1, although time
> is really running out..

I'll commit the attached fix shortly, so please shout quickly if you see
a problem with this.

Aside from the potential for segfaults with assertions, I think the
calls passed incorrect ctid anyway. For example:

> --- a/src/backend/executor/execUtils.c
> +++ b/src/backend/executor/execUtils.c
> @@ -1307,7 +1307,7 @@ retry:
>                 if (TransactionIdIsValid(xwait))
>                 {
>                         index_endscan(index_scan);
> -                       XactLockTableWait(xwait, heap, &tup->t_data->t_ctid,
> +                       XactLockTableWait(xwait, heap, &tup->t_self,
>                                                           XLTW_RecheckExclusionConstr);
>                         goto retry;
>                 }

We don't really want to pass the heap tuple's ctid field. If the tuple
was updated (again) in the same transaction, the one that's still
in-progress, that points to the *next* tuple in the chain, but the error
message says "while checking exclusion constraint on tuple (%u,%u) in
relation \"%s\"". We should be passing the TID of the tuple itself, not
the ctid value in the tuple's header. The attached patch fixes that too.

- Heikki

Attachment

Re: Segfault on exclusion constraint violation

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> 9.4 and master segfaults, if an insertion would need to wait for another
> transaction to finish because of an exclusion constraint. To reproduce:
> ...
> This only happens with assertions enabled. The culprit is commit
> f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
> argument to XactLockTableWait. check_exclusion_constraint calls
> index_endscan() just before XactLockTableWait, but that free's the
> memory the ctid points to.

> I'll write up a patch to change those call sites to use local variables.
> Hopefully it's trivial enough to still include in 9.4.1, although time
> is really running out..

If the only known bad consequence requires assertions enabled, I think
it would be more prudent to *not* try to fix this in haste.

            regards, tom lane

Re: Segfault on exclusion constraint violation

From
Heikki Linnakangas
Date:
On 02/02/2015 04:38 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> 9.4 and master segfaults, if an insertion would need to wait for another
>> transaction to finish because of an exclusion constraint. To reproduce:
>> ...
>> This only happens with assertions enabled. The culprit is commit
>> f88d4cfc9d417dac2ee41a8f5e593898e56fd2bd, which added the 'ctid'
>> argument to XactLockTableWait. check_exclusion_constraint calls
>> index_endscan() just before XactLockTableWait, but that free's the
>> memory the ctid points to.
>
>> I'll write up a patch to change those call sites to use local variables.
>> Hopefully it's trivial enough to still include in 9.4.1, although time
>> is really running out..
>
> If the only known bad consequence requires assertions enabled, I think
> it would be more prudent to *not* try to fix this in haste.

Ok, I'll wait until after the release.

- Heikki

Re: Segfault on exclusion constraint violation

From
Dennis Pozzi
Date:
We are seeing a similar segfault scenario in 9.4.1 without assertions enabl=
ed. We upgraded 4 of our production instances from 9.3.5 to 9.4.1 over the =
weekend.
We had staged on 9.4.0, and then 9.4.1 for over a month without seeing this=
 error, but the staging work load is much lower than production.

postgres=3D# show debug_assertions;
 debug_assertions
------------------
 off
(1 row)

postgres=3D# set debug_assertions =3D 'on' ;
ERROR:  assertion checking is not supported by this build
postgres=3D#

We believe the error occurred when an insert query was running  inside a tr=
ansaction, and another query attempted to insert into the same table.

Postgres Log says :
2015-03-16 02:43:25.762 PDT,,,5481,,5504121c.1569,3,,2015-03-14 03:49:00 PD=
T,,0,LOG,00000,"server process (PID 9471) was terminated by signal 11: Segm=
entation fault","Failed process was running: INSERT into ad_instances (adin=
stid, cid, kid, adrefid, termid, sync_match_type,
                                               status_code, start_date, syn=
c_keyword)
                     SELECT nextval('public.ad_instances_adinstid_seq'), ci=
d, kw.kid, adg.adrefid,
                            termid, ai_mtid, 'u', now(), sync_keyword
                     FROM   tmp_keywords_1100218810
                     JOIN   keywords kw using (keyword)
                     JOIN   adgroups adg using (adref,cid)
                     WHERE  keyword_type =3D 'n'
                        AND cid =3D 1100218810
                     GROUP BY cid, kw.kid, adg.adrefid, termid, ai_mtid, sy=
nc_keyword, tmp_keywords_1100218810.status_code",,,,,,,,""

Application log says:
2015-03-16 02:43:25.763 PDT,"release","c10036",4349,"op02.lon5.efrontier.co=
m:39404",5506a408.10fd,27,"idle in transaction",2015-03-16 02:36:08 PDT,44/=
292306,3267900389,WARNING,57P02,"terminating connection because of crash of=
 another server process","The postmaster has commanded this server process =
to roll back the current transaction and exit, because another server proce=
ss exited abnormally and possibly corrupted shared memory.","In a moment yo=
u should be able to reconnect to the database and repeat your command.",,,,=
,,,"ad_status"

/var/log/messages says:
Mar 16 02:40:35 user27 kernel: : postgres[9471]: segfault at 30 ip 00000000=
0066148b sp 00007fffa9c5f5f0 error 4 in postgres[400000+54d000]

We have not been able to reproduce the error, but we are testing scenarios =
in our staging environment now.

Re: Segfault on exclusion constraint violation

From
Michael Paquier
Date:
On Wed, Mar 18, 2015 at 12:15 AM, Dennis Pozzi <dpozzi@adobe.com> wrote:
> We have not been able to reproduce the error, but we are testing scenarios in our staging environment now.

If this is the same problem, this will be fixed in the next minor
release as the fix has been committed here after 9.4.1 was released:
commit: 57fe246890ad51e166fb6a8da937e41c35d7a279
author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Wed, 4 Feb 2015 16:00:34 +0200
Fix reference-after-free when waiting for another xact due to constraint.

Now, if you are able to produce a self-contained test case that fails
even on HEAD, well that's another story...
Regards,
--
Michael