Thread: "double free" segfault back in pyscopg2 2.5
Hi,
It looks like the "double free" segfault from pyscopg2 2.0.8 ( http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 ) is back in version 2.5:** glibc detected *** httpd: double free or corruption (fasttop): 0x00007fb15de14180 ***
#0 0x0000003a8d232a45 in raise () from /lib64/libc.so.6
#1 0x0000003a8d234225 in abort () from /lib64/libc.so.6
#2 0x0000003a8d26fdfb in __libc_message () from /lib64/libc.so.6
#3 0x0000003a8d275716 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007fee8228c3f4 in connection_dealloc (obj=0x7fee58055da0) at psycopg/connection_type.c:1141
#5 0x00007fef12357e26 in subtype_dealloc (self=0x7fee58055da0) at Objects/typeobject.c:1014
#6 0x00007fef1233786b in dict_dealloc (mp=0x7fee58052e10) at Objects/dictobject.c:985
#7 0x00007fef12357e6c in subtype_dealloc (self=0x7fee880aed90) at Objects/typeobject.c:999
#8 0x00007fef1233786b in dict_dealloc (mp=0x7fee5803a770) at Objects/dictobject.c:985
#9 0x00007fef123d7367 in frame_dealloc (f=0x7fee580477c0) at Objects/frameobject.c:469
#10 0x00007fef1237d0fe in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4109
#11 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#12 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#13 0x00007fef1237d0d6 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
As before, the problem seems to be that when de-allocing the connection, we are calling conn_close before untracking the object. conn_close allows other threads to run and call the garbage collector, which ends up running dealloc a second time on this object. So we free the same memory a second time and hence the double free error.
Gangadharan
On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote: > Hi, > > > It looks like the "double free" segfault from pyscopg2 2.0.8 > ( http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 ) > is back in version 2.5: [...] > As before, the problem seems to be that when de-allocing the > connection, we are calling conn_close before untracking the object. > conn_close allows other threads to run and call the garbage collector, > which ends up running dealloc a second time on this object. So we free > the same memory a second time and hence the double free error. > > > The fix would be to call conn_close after untracking the object in > connection_type.c:connection_dealloc(). I see: sorry, I've broken it in commit 5aafe38f. Looks like in that commit I've fixed the cursor and broken the connection. > The script used to reproduce the issue back in > http://comments.gmane.org/gmane.comp.python.db.psycopg.devel/4964 will > not reproduce the issue any more because we don't rollback the > in-progress transaction during connection close anymore. The only way > I reliably could reproduce the issue in a test script was to introduce > a sleep in the C code at connection_int.c:conn_close() after > Py_BEGIN_ALLOW_THREADS and then run garbage collector from another > thread during the sleep. > > > Thanks, > Gangadharan Thank *you* very much. Open ticket #166. I had in mind to release a new version in the next days: will include this correction too. -- Daniele
On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote: > The fix would be to call conn_close after untracking the object in > connection_type.c:connection_dealloc(). Pushed correction at https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4 Is this enough to consider the issue fixed? I've checked other destructors and we untrack the object before any other operation in all the objects. Thanks -- Daniele
Thanks! Yes, this change should fix the issue - and it did in my testing. Do we know of a rough esitmate on when this will be released as a version? On 20-Jun-2013, at 9:12 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote: > >> The fix would be to call conn_close after untracking the object in >> connection_type.c:connection_dealloc(). > > Pushed correction at > https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4 > > Is this enough to consider the issue fixed? > > I've checked other destructors and we untrack the object before any > other operation in all the objects. > > Thanks > > > -- > Daniele > > > > -- > Sent via psycopg mailing list (psycopg@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/psycopg
On Thu, 2013-06-20 at 21:17 +0530, Gangadharan S.A. wrote: > Thanks! Yes, this change should fix the issue - and it did in my testing. > > Do we know of a rough esitmate on when this will be released as a version? I'll make a release across the weekend. You should have a 2.5.1 by Monday. Thank you again for fixing the same issue twice at 4 years distance. I'm very curious about your setup: if you could spend two words on it we may put together some stressing rig. Cheers, -- Daniele
I'll make a release across the weekend. You should have a 2.5.1 by Monday.
Nice! Thanks.
Thank you again for fixing the same issue twice at 4 years distance. I'm
very curious about your setup: if you could spend two words on it we may
put together some stressing rig.
Glad to help. I think there is a combination of factors in our usage pattern of psycopg that increases the likelihood these kinds of issues surfacing:
1. We subclass the connection class and pass it in as connection_factory argument to psycopg2.connect() . Subclassing is necessary to make the garbage collector worry about this object having cyclic references and try tp_clear on it. This is how we end up in a place where the reference counting based destruction of the object on one thread coincides with the garbage collector trying to free the same object from another thread. The subclass definition of the connection class can simply be a "pass" statement and need not implement any actual functionality.
2. We have 100s of threads connecting to 100s of databases rationed out by a connection pool manager library. Connections are closed and opened frequently.
Please let me know if further information is needed.
Thanks,
Gangadharan
Gangadharan
On Thu, Jun 20, 2013 at 9:24 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:
On Thu, 2013-06-20 at 21:17 +0530, Gangadharan S.A. wrote:I'll make a release across the weekend. You should have a 2.5.1 by
> Thanks! Yes, this change should fix the issue - and it did in my testing.
>
> Do we know of a rough esitmate on when this will be released as a version?
Monday.
Thank you again for fixing the same issue twice at 4 years distance. I'm
very curious about your setup: if you could spend two words on it we may
put together some stressing rig.
Cheers,
--
Daniele
On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote: > 1. We subclass the connection class and pass it in as > connection_factory argument to psycopg2.connect() . Subclassing is > necessary to make the garbage collector worry about this object having > cyclic references and try tp_clear on it. This is how we end up in a > place where the reference counting based destruction of the object on > one thread coincides with the garbage collector trying to free the > same object from another thread. The subclass definition of the > connection class can simply be a "pass" statement and need not > implement any actual functionality. Do you mean a plain Python subclass or a C type subclass? A failed attempt at creating connection/cursor C subclasses some times ago led me to review the objects GC interaction (and re-introduce the bug). > 2. We have 100s of threads connecting to 100s of databases rationed > out by a connection pool manager library. Connections are closed and > opened frequently. I guessed that was the minimum :) > Please let me know if further information is needed. How long have you been running 2.5 in production? Any other issue to report that may be fixed for 2.5.1? -- Daniele
> Do you mean a plain Python subclass or a C type subclass? Plain python subclass. And as I was saying, simply having a subclass would be enough. It need not have any additional functionailty. It can just have a "pass" statement in the subclass body. This is demonstrated in the test script associated with the first segfault mail thread from 4 years ago. > How long have you been running 2.5 in production? Nearly 2 weeks in production. The issue would show up only at production scale and not in our test environments. Failure would have happen roughly a couple of times per day per daemon process. > Any other issue to > report that may be fixed for 2.5.1? Nothing else so far. On 20-Jun-2013, at 11:18 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote: > >> 1. We subclass the connection class and pass it in as >> connection_factory argument to psycopg2.connect() . Subclassing is >> necessary to make the garbage collector worry about this object having >> cyclic references and try tp_clear on it. This is how we end up in a >> place where the reference counting based destruction of the object on >> one thread coincides with the garbage collector trying to free the >> same object from another thread. The subclass definition of the >> connection class can simply be a "pass" statement and need not >> implement any actual functionality. > > Do you mean a plain Python subclass or a C type subclass? A failed > attempt at creating connection/cursor C subclasses some times ago led me > to review the objects GC interaction (and re-introduce the bug). > > >> 2. We have 100s of threads connecting to 100s of databases rationed >> out by a connection pool manager library. Connections are closed and >> opened frequently. > > I guessed that was the minimum :) > > >> Please let me know if further information is needed. > > > > > -- > Daniele >
Thanks! Yes, this change should fix the issue - and it did in my testing. Do we know of a rough esitmate on when this will be released as a version? On 20-Jun-2013, at 9:12 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Thu, 2013-06-20 at 19:34 +0530, Gangadharan S.A. wrote: > >> The fix would be to call conn_close after untracking the object in >> connection_type.c:connection_dealloc(). > > Pushed correction at > https://github.com/dvarrazzo/psycopg/commit/889b1d826e9cf0ecbc9af938bad72937592af5e4 > > Is this enough to consider the issue fixed? > > I've checked other destructors and we untrack the object before any > other operation in all the objects. > > Thanks > > > -- > Daniele > > > > -- > Sent via psycopg mailing list (psycopg@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/psycopg
> Do you mean a plain Python subclass or a C type subclass? Plain python subclass. And as I was saying, simply having a subclass would be enough. It need not have any additional functionailty. It can just have a "pass" statement in the subclass body. This is demonstrated in the test script associated with the first segfault mail thread from 4 years ago. > How long have you been running 2.5 in production? Nearly 2 weeks in production. The issue would show up only at production scale and not in our test environments. Failure would have happen roughly a couple of times per day per daemon process. > Any other issue to > report that may be fixed for 2.5.1? Nothing else so far. On 20-Jun-2013, at 11:18 PM, Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Thu, 2013-06-20 at 22:18 +0530, Gangadharan S.A. wrote: > >> 1. We subclass the connection class and pass it in as >> connection_factory argument to psycopg2.connect() . Subclassing is >> necessary to make the garbage collector worry about this object having >> cyclic references and try tp_clear on it. This is how we end up in a >> place where the reference counting based destruction of the object on >> one thread coincides with the garbage collector trying to free the >> same object from another thread. The subclass definition of the >> connection class can simply be a "pass" statement and need not >> implement any actual functionality. > > Do you mean a plain Python subclass or a C type subclass? A failed > attempt at creating connection/cursor C subclasses some times ago led me > to review the objects GC interaction (and re-introduce the bug). > > >> 2. We have 100s of threads connecting to 100s of databases rationed >> out by a connection pool manager library. Connections are closed and >> opened frequently. > > I guessed that was the minimum :) > > >> Please let me know if further information is needed. > > > > > -- > Daniele >