Thread: "double free" segfault back in pyscopg2 2.5

"double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 14:04:34

Hi,

It looks like the "double free" segfault from pyscopg2 2.0.8 ( http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 ) is back in version 2.5:

** glibc detected *** httpd: double free or corruption (fasttop): 0x00007fb15de14180 ***

#0 0x0000003a8d232a45 in raise () from /lib64/libc.so.6
#1 0x0000003a8d234225 in abort () from /lib64/libc.so.6
#2 0x0000003a8d26fdfb in __libc_message () from /lib64/libc.so.6
#3 0x0000003a8d275716 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007fee8228c3f4 in connection_dealloc (obj=0x7fee58055da0) at psycopg/connection_type.c:1141
#5 0x00007fef12357e26 in subtype_dealloc (self=0x7fee58055da0) at Objects/typeobject.c:1014
#6 0x00007fef1233786b in dict_dealloc (mp=0x7fee58052e10) at Objects/dictobject.c:985
#7 0x00007fef12357e6c in subtype_dealloc (self=0x7fee880aed90) at Objects/typeobject.c:999
#8 0x00007fef1233786b in dict_dealloc (mp=0x7fee5803a770) at Objects/dictobject.c:985
#9 0x00007fef123d7367 in frame_dealloc (f=0x7fee580477c0) at Objects/frameobject.c:469
#10 0x00007fef1237d0fe in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4109
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007fef1237d0d6 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107

As before, the problem seems to be that when de-allocing the connection, we are calling conn_close before untracking the object. conn_close allows other threads to run and call the garbage collector, which ends up running dealloc a second time on this object. So we free the same memory a second time and hence the double free error.

The fix would be to call conn_close after untracking the object in connection_type.c:connection_dealloc().

The script used to reproduce the issue back in http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 will not reproduce the issue any more because we don't rollback the in-progress transaction during connection close anymore. The only way I reliably could reproduce the issue in a test script was to introduce a sleep in the C code at connection_int.c:conn_close() after Py_BEGIN_ALLOW_THREADS and then run garbage collector from another thread during the sleep.

Thanks,
Gangadharan

Re: "double free" segfault back in pyscopg2 2.5

From

Daniele Varrazzo

Date:

20 June 2013, 15:22:26

On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote:
> Hi,
>
>
> It looks like the "double free" segfault from pyscopg2 2.0.8
> ( http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 )
> is back in version 2.5:
[...]
> As before, the problem seems to be that when de-allocing the
> connection, we are calling conn_close before untracking the object.
> conn_close allows other threads to run and call the garbage collector,
> which ends up running dealloc a second time on this object. So we free
> the same memory a second time and hence the double free error.
>
>
> The fix would be to call conn_close after untracking the object in
> connection_type.c:connection_dealloc().

I see: sorry, I've broken it in commit 5aafe38f. Looks like in that
commit I've fixed the cursor and broken the connection.


> The script used to reproduce the issue back in
> http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 will
> not reproduce the issue any more because we don't rollback the
> in-progress transaction during connection close anymore. The only way
> I reliably could reproduce the issue in a test script was to introduce
> a sleep in the C code at connection_int.c:conn_close() after
> Py_BEGIN_ALLOW_THREADS and then run garbage collector from another
> thread during the sleep.
>
>
> Thanks,
> Gangadharan

Thank *you* very much. Open ticket #166. I had in mind to release a new
version in the next days: will include this correction too.


--
Daniele

Re: "double free" segfault back in pyscopg2 2.5

From

Daniele Varrazzo

Date:

20 June 2013, 15:42:36

On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote:

> The fix would be to call conn_close after untracking the object in
> connection_type.c:connection_dealloc().

Pushed correction at
https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4

Is this enough to consider the issue fixed?

I've checked other destructors and we untrack the object before any
other operation in all the objects.

Thanks

--
Daniele

Re: "double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 15:47:14

Thanks! Yes, this change should fix the issue - and it did in my testing.

Do we know of a rough esitmate on when this will be released as a version?


On 20-Jun-2013, at 9:12 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

> On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote:
>
>> The fix would be to call conn_close after untracking the object in
>> connection_type.c:connection_dealloc().
>
> Pushed correction at
> https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4
>
> Is this enough to consider the issue fixed?
>
> I've checked other destructors and we untrack the object before any
> other operation in all the objects.
>
> Thanks
>
>
> --
> Daniele
>
>
>
> --
> Sent via psycopg mailing list (psycopg@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/psycopg

Re: "double free" segfault back in pyscopg2 2.5

From

Daniele Varrazzo

Date:

20 June 2013, 15:54:56

On Thu, 2013-06-20 at 21:17 +0530, Gangadharan S.A. wrote:
> Thanks! Yes, this change should fix the issue - and it did in my testing.
>
> Do we know of a rough esitmate on when this will be released as a version?

I'll make a release across the weekend. You should have a 2.5.1 by
Monday.

Thank you again for fixing the same issue twice at 4 years distance. I'm
very curious about your setup: if you could spend two words on it we may
put together some stressing rig.

Cheers,

--
Daniele

Re: "double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 16:48:53

I'll make a release across the weekend. You should have a 2.5.1 by Monday.

Nice! Thanks.

Thank you again for fixing the same issue twice at 4 years distance. I'm
very curious about your setup: if you could spend two words on it we may
put together some stressing rig.

Glad to help. I think there is a combination of factors in our usage pattern of psycopg that increases the likelihood these kinds of issues surfacing:

1. We subclass the connection class and pass it in as connection_factory argument to psycopg2.connect() . Subclassing is necessary to make the garbage collector worry about this object having cyclic references and try tp_clear on it. This is how we end up in a place where the reference counting based destruction of the object on one thread coincides with the garbage collector trying to free the same object from another thread. The subclass definition of the connection class can simply be a "pass" statement and need not implement any actual functionality.

2. We have 100s of threads connecting to 100s of databases rationed out by a connection pool manager library. Connections are closed and opened frequently.

Please let me know if further information is needed.

Thanks,
Gangadharan

On Thu, Jun 20, 2013 at 9:24 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

On Thu, 2013-06-20 at 21:17 +0530, Gangadharan S.A. wrote:
> Thanks! Yes, this change should fix the issue - and it did in my testing.
>
> Do we know of a rough esitmate on when this will be released as a version?

I'll make a release across the weekend. You should have a 2.5.1 by
Monday.

Thank you again for fixing the same issue twice at 4 years distance. I'm
very curious about your setup: if you could spend two words on it we may
put together some stressing rig.

Cheers,

--
Daniele

Re: "double free" segfault back in pyscopg2 2.5

From

Daniele Varrazzo

Date:

20 June 2013, 17:48:15

On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote:

> 1. We subclass the connection class and pass it in as
> connection_factory argument to psycopg2.connect() . Subclassing is
> necessary to make the garbage collector worry about this object having
> cyclic references and try tp_clear on it. This is how we end up in a
> place where the reference counting based destruction of the object on
> one thread coincides with the garbage collector trying to free the
> same object from another thread. The subclass definition of the
> connection class can simply be a "pass" statement and need not
> implement any actual functionality.

Do you mean a plain Python subclass or a C type subclass? A failed
attempt at creating connection/cursor C subclasses some times ago led me
to review the objects GC interaction (and re-introduce the bug).


> 2. We have 100s of  threads connecting to 100s of databases rationed
> out by a connection pool manager library. Connections are closed and
> opened frequently.

I guessed that was the minimum :)


> Please let me know if further information is needed.

How long have you been running 2.5 in production? Any other issue to
report that may be fixed for 2.5.1?


--
Daniele

Re: "double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 18:00:17

> Do you mean a plain Python subclass or a C type subclass?

Plain python subclass. And as I was saying, simply having a subclass
would be enough. It need not have any additional functionailty. It can
just have a "pass" statement in the subclass body. This is
demonstrated in the test script associated with the first segfault
mail thread from 4 years ago.

> How long have you been running 2.5 in production?

Nearly 2 weeks in production. The issue would show up only at
production scale and not in our test environments. Failure would have
happen roughly a couple of times per day per daemon process.

> Any other issue to
> report that may be fixed for 2.5.1?

Nothing else so far.

On 20-Jun-2013, at 11:18 PM, Daniele Varrazzo
<daniele.varrazzo@gmail.com> wrote:

> On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote:
>
>> 1. We subclass the connection class and pass it in as
>> connection_factory argument to psycopg2.connect() . Subclassing is
>> necessary to make the garbage collector worry about this object having
>> cyclic references and try tp_clear on it. This is how we end up in a
>> place where the reference counting based destruction of the object on
>> one thread coincides with the garbage collector trying to free the
>> same object from another thread. The subclass definition of the
>> connection class can simply be a "pass" statement and need not
>> implement any actual functionality.
>
> Do you mean a plain Python subclass or a C type subclass? A failed
> attempt at creating connection/cursor C subclasses some times ago led me
> to review the objects GC interaction (and re-introduce the bug).
>
>
>> 2. We have 100s of  threads connecting to 100s of databases rationed
>> out by a connection pool manager library. Connections are closed and
>> opened frequently.
>
> I guessed that was the minimum :)
>
>
>> Please let me know if further information is needed.
>
>
>
>
> --
> Daniele
>

Re: "double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 19:47:02

Thanks! Yes, this change should fix the issue - and it did in my testing.

Do we know of a rough esitmate on when this will be released as a version?


On 20-Jun-2013, at 9:12 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

> On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote:
>
>> The fix would be to call conn_close after untracking the object in
>> connection_type.c:connection_dealloc().
>
> Pushed correction at
> https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4
>
> Is this enough to consider the issue fixed?
>
> I've checked other destructors and we untrack the object before any
> other operation in all the objects.
>
> Thanks
>
>
> --
> Daniele
>
>
>
> --
> Sent via psycopg mailing list (psycopg@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/psycopg

Re: "double free" segfault back in pyscopg2 2.5

From

"Gangadharan S.A."

Date:

20 June 2013, 21:59:53

> Do you mean a plain Python subclass or a C type subclass?

Plain python subclass. And as I was saying, simply having a subclass
would be enough. It need not have any additional functionailty. It can
just have a "pass" statement in the subclass body. This is
demonstrated in the test script associated with the first segfault
mail thread from 4 years ago.

> How long have you been running 2.5 in production?

Nearly 2 weeks in production. The issue would show up only at
production scale and not in our test environments. Failure would have
happen roughly a couple of times per day per daemon process.

> Any other issue to
> report that may be fixed for 2.5.1?

Nothing else so far.

On 20-Jun-2013, at 11:18 PM, Daniele Varrazzo
<daniele.varrazzo@gmail.com> wrote:

> On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote:
>
>> 1. We subclass the connection class and pass it in as
>> connection_factory argument to psycopg2.connect() . Subclassing is
>> necessary to make the garbage collector worry about this object having
>> cyclic references and try tp_clear on it. This is how we end up in a
>> place where the reference counting based destruction of the object on
>> one thread coincides with the garbage collector trying to free the
>> same object from another thread. The subclass definition of the
>> connection class can simply be a "pass" statement and need not
>> implement any actual functionality.
>
> Do you mean a plain Python subclass or a C type subclass? A failed
> attempt at creating connection/cursor C subclasses some times ago led me
> to review the objects GC interaction (and re-introduce the bug).
>
>
>> 2. We have 100s of  threads connecting to 100s of databases rationed
>> out by a connection pool manager library. Connections are closed and
>> opened frequently.
>
> I guessed that was the minimum :)
>
>
>> Please let me know if further information is needed.
>
>
>
>
> --
> Daniele
>