Thread: [BUG] Re-entering malloc problem when use --enable-nls build postgresql
HI pg team:
I found that compiling postgresql with enable-nls may be introduce a problem
1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.
5. This is the deadlock process stack:
Attachment
"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes: > I found that compiling postgresql with enable-nls may be introduce a problem > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language. > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function) > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database. > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to processdeadlock. I can't get excited about this. quickdie's attempt to report that it's killing the process is necessarily a "best effort" undertaking, because we cannot be sure that the process is in a good state. In this situation, it isn't. --enable-nls might make the odds of that a bit worse, but we could get such a failure regardless. There are not any better alternatives. We can't just set a flag in the signal handler and hope that control will someday reach a place that notices the flag. We could exit without attempting to report anything, but nobody would find that user-friendly. So we try to report, in the full understanding that sometimes it won't work. That's one reason why the postmaster has a timeout-and-then-hard-SIGKILL behavior. regards, tom lane
On 2018-05-08 01:32:33 -0400, Tom Lane wrote: > "=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes: > > I found that compiling postgresql with enable-nls may be introduce a problem > > > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language. > > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function) > > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database. > > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to processdeadlock. I don't think it's realistic to treat this is something we'll necessarily backpatch. Any fix is going to be too complicated. > I can't get excited about this. quickdie's attempt to report that it's > killing the process is necessarily a "best effort" undertaking, because > we cannot be sure that the process is in a good state. In this situation, > it isn't. --enable-nls might make the odds of that a bit worse, but we > could get such a failure regardless. As previously, I disagree with this. There's plenty ways to reach quickdie, several where we are quite sure about the state. -m immediate is a thing, and there's plenty situations where it's an entirely reasonable choice. > There are not any better alternatives. We can't just set a flag in the > signal handler and hope that control will someday reach a place that > notices the flag. We could exit without attempting to report anything, > but nobody would find that user-friendly. So we try to report, in the > full understanding that sometimes it won't work. It'd be fairly unproblematic to write an untranslated message out. There we can make sure to either only use plain syscalls or use memory from the preallocated context. I think it'd be ok to not to translate in that situation. We should also use _exit or the like when exiting quickly, calling exit handlers from a signal handlers is a bad bad idea. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-05-08 01:32:33 -0400, Tom Lane wrote: >> There are not any better alternatives. We can't just set a flag in the >> signal handler and hope that control will someday reach a place that >> notices the flag. We could exit without attempting to report anything, >> but nobody would find that user-friendly. So we try to report, in the >> full understanding that sometimes it won't work. > It'd be fairly unproblematic to write an untranslated message out. To stderr, maybe. Across an SSL-encrypted client connection? You're dreaming. regards, tom lane
> It'd be fairly unproblematic to write an untranslated message out. There
> we can make sure to either only use plain syscalls or use memory from
> the preallocated context. I think it'd be ok to not to translate in
> that situation.
> we can make sure to either only use plain syscalls or use memory from
> the preallocated context. I think it'd be ok to not to translate in
> that situation.
I agree with this solution when build postgres with enable-nls.
------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Tue, May 8, 2018 01:57 PM
To: "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc: "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql
> "158306855" <anderson2013@qq.com> writes:
> > I found that compiling postgresql with enable-nls may be introduce a problem
>
> > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
> > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
> > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.
I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.
> I can't get excited about this. quickdie's attempt to report that it's
> killing the process is necessarily a "best effort" undertaking, because
> we cannot be sure that the process is in a good state. In this situation,
> it isn't. --enable-nls might make the odds of that a bit worse, but we
> could get such a failure regardless.
As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.
> There are not any better alternatives. We can't just set a flag in the
> signal handler and hope that control will someday reach a place that
> notices the flag. We could exit without attempting to report anything,
> but nobody would find that user-friendly. So we try to report, in the
> full understanding that sometimes it won't work.
It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.
We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.
Greetings,
Andres Freund
On 2018-05-08 02:07:08 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-05-08 01:32:33 -0400, Tom Lane wrote: > >> There are not any better alternatives. We can't just set a flag in the > >> signal handler and hope that control will someday reach a place that > >> notices the flag. We could exit without attempting to report anything, > >> but nobody would find that user-friendly. So we try to report, in the > >> full understanding that sometimes it won't work. > > > It'd be fairly unproblematic to write an untranslated message out. > > To stderr, maybe. Across an SSL-encrypted client connection? You're > dreaming. libpq invents an equivalent message when the server closes the connection anyway, IIRC. So that'd not necessarily be too bad. Greetings, Andres Freund
On 2018-05-08 01:36:31 -0700, Andres Freund wrote: > On 2018-05-08 02:07:08 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On 2018-05-08 01:32:33 -0400, Tom Lane wrote: > > >> There are not any better alternatives. We can't just set a flag in the > > >> signal handler and hope that control will someday reach a place that > > >> notices the flag. We could exit without attempting to report anything, > > >> but nobody would find that user-friendly. So we try to report, in the > > >> full understanding that sometimes it won't work. > > > > > It'd be fairly unproblematic to write an untranslated message out. > > > > To stderr, maybe. Across an SSL-encrypted client connection? You're > > dreaming. > > libpq invents an equivalent message when the server closes the > connection anyway, IIRC. So that'd not necessarily be too bad. Oh, also: It looks like it'd actually be relatively easy to give openssl its own memory allocator + pool: Create a global 'openssl' memory context with preallocation, use CRYPTO_set_mem_functions() to make openssl allocations go through small wrapper functions around palloc/repalloc/pfree. It's still not entirely kosher to call into openssl from a signal handler because we could be inside openssl - but the window for that is a lot smaller than being inside *any* memory allocation. Greetings, Andres Freund
On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote: > On 2018-05-08 01:36:31 -0700, Andres Freund wrote: >> On 2018-05-08 02:07:08 -0400, Tom Lane wrote: >> > To stderr, maybe. Across an SSL-encrypted client connection? You're >> > dreaming. >> >> libpq invents an equivalent message when the server closes the >> connection anyway, IIRC. So that'd not necessarily be too bad. > > Oh, also: It looks like it'd actually be relatively easy to give openssl > its own memory allocator + pool: > Create a global 'openssl' memory context with preallocation, use > CRYPTO_set_mem_functions() to make openssl allocations go through small > wrapper functions around palloc/repalloc/pfree. > > It's still not entirely kosher to call into openssl from a signal > handler because we could be inside openssl - but the window for that is > a lot smaller than being inside *any* memory allocation. Can't we use a more traditional signal handling style to defer execution here? 1. Set an in_OpenSSL flag whenever you're about to enter OpenSSL. 2. In the SIGQUIT handler, if you find that flag is set, then just set got_SIGQUIT and return. 3. After you leave the OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly. -- Thomas Munro http://www.enterprisedb.com
On 2018-05-09 09:30:58 +1200, Thomas Munro wrote: > On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2018-05-08 01:36:31 -0700, Andres Freund wrote: > >> On 2018-05-08 02:07:08 -0400, Tom Lane wrote: > >> > To stderr, maybe. Across an SSL-encrypted client connection? You're > >> > dreaming. > >> > >> libpq invents an equivalent message when the server closes the > >> connection anyway, IIRC. So that'd not necessarily be too bad. > > > > Oh, also: It looks like it'd actually be relatively easy to give openssl > > its own memory allocator + pool: > > Create a global 'openssl' memory context with preallocation, use > > CRYPTO_set_mem_functions() to make openssl allocations go through small > > wrapper functions around palloc/repalloc/pfree. > > > > It's still not entirely kosher to call into openssl from a signal > > handler because we could be inside openssl - but the window for that is > > a lot smaller than being inside *any* memory allocation. > > Can't we use a more traditional signal handling style to defer > execution here? 1. Set an in_OpenSSL flag whenever you're about to > enter OpenSSL. 2. In the SIGQUIT handler, if you find that flag is > set, then just set got_SIGQUIT and return. 3. After you leave the > OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly. Well, the question is if that'd ever have us defer killing the process for longer. quickdie is intended to actually die quickly. If there's rekeying, a large COPY incoming, or something like that it could take a bit. Aren't there also still places where we do blocking network IO? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Well, the question is if that'd ever have us defer killing the process > for longer. quickdie is intended to actually die quickly. Yeah. Though now that we have the postmaster mechanism to wait-five- seconds-then-SIGKILL, maybe we could rethink that requirement? If we reimplemented SIGQUIT to work more like SIGTERM, it would surely be a lot safer. There would be cases where a stuck backend wouldn't respond and it'd eventually get SIGKILL'd, but in return we'd get rid of problems like this one. On balance I'm not sure whether that solution is more or less user- friendly than the current setup, but for sure it'd be a lot easier to reason about. regards, tom lane
On 2018-05-08 18:04:07 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Well, the question is if that'd ever have us defer killing the process > > for longer. quickdie is intended to actually die quickly. > > Yeah. Though now that we have the postmaster mechanism to wait-five- > seconds-then-SIGKILL, maybe we could rethink that requirement? If we > reimplemented SIGQUIT to work more like SIGTERM, it would surely be > a lot safer. There would be cases where a stuck backend wouldn't > respond and it'd eventually get SIGKILL'd, but in return we'd get rid > of problems like this one. Right now we intentionally do not accept interrupts in a couple places where we want to die quickly because we're making persistent changes. I don't think it'd be good to continue e.g. committing any longer than possible after one process segfaulted. One counter-argument to that is that the timing right now is far from synchronous either. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-05-08 18:04:07 -0400, Tom Lane wrote: >> Yeah. Though now that we have the postmaster mechanism to wait-five- >> seconds-then-SIGKILL, maybe we could rethink that requirement? If we >> reimplemented SIGQUIT to work more like SIGTERM, it would surely be >> a lot safer. There would be cases where a stuck backend wouldn't >> respond and it'd eventually get SIGKILL'd, but in return we'd get rid >> of problems like this one. > Right now we intentionally do not accept interrupts in a couple places > where we want to die quickly because we're making persistent changes. I > don't think it'd be good to continue e.g. committing any longer than > possible after one process segfaulted. Well, that's an implementation question. I was sort of envisioning that we could allow CHECK_FOR_INTERRUPTS to respond to a SIGQUIT interrupt even when regular interrupts are disabled, reasoning that if we're at a CHECK_FOR_INTERRUPTS at all, then we should be someplace where it's more or less safe to trigger the exit. If we had it like that, then for example the commit sequence could put a CHECK_FOR_INTERRUPTS at the last possible moment before committing, knowing that regular interrupts are held off --- but if there's a SIGQUIT pending we'd take it. regards, tom lane
> It'd be fairly unproblematic to write an untranslated message out. There
> we can make sure to either only use plain syscalls or use memory from
> the preallocated context. I think it'd be ok to not to translate in
> that situation.
> we can make sure to either only use plain syscalls or use memory from
> the preallocated context. I think it'd be ok to not to translate in
> that situation.
I wrote a patch to improve this problem.
zeng wenjing
------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Tue, May 8, 2018 01:57 PM
To: "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc: "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql
> "158306855" <anderson2013@qq.com> writes:
> > I found that compiling postgresql with enable-nls may be introduce a problem
>
> > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
> > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext function)
> > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.
I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.
> I can't get excited about this. quickdie's attempt to report that it's
> killing the process is necessarily a "best effort" undertaking, because
> we cannot be sure that the process is in a good state. In this situation,
> it isn't. --enable-nls might make the odds of that a bit worse, but we
> could get such a failure regardless.
As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.
> There are not any better alternatives. We can't just set a flag in the
> signal handler and hope that control will someday reach a place that
> notices the flag. We could exit without attempting to report anything,
> but nobody would find that user-friendly. So we try to report, in the
> full understanding that sometimes it won't work.
It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context. I think it'd be ok to not to translate in
that situation.
We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.
Greetings,
Andres Freund
Attachment
On 2018-05-21 10:49:53 +0800, 158306855 wrote: > > It'd be fairly unproblematic to write an untranslated message out. There > > we can make sure to either only use plain syscalls or use memory from > > the preallocated context. I think it'd be ok to not to translate in > > that situation. > > > I wrote a patch to improve this problem. > > +extern int errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2); > +extern int errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2); > +extern int errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2); > + > /* > * errcontext() is typically called in error context callback functions, not > * within an ereport() invocation. The callback function can be in a different Can't we just reuse errmsg_internal etc? Greetings, Andres Freund
> Can't we just reuse errmsg_internal etc?
Yes, the *_internal function can be reused. This is my modified patch.
Please take a look.
Zeng Wenjing
------------------ Original ------------------
From: "Andres Freund"<andres@anarazel.de>;
Date: Wed, May 23, 2018 05:11 AM
To: "158306855"<anderson2013@qq.com>;
Cc: "Tom Lane"<tgl@sss.pgh.pa.us>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: [BUG] Re-entering malloc problem when use --enable-nlsbuildpostgresql
> > It'd be fairly unproblematic to write an untranslated message out. There
> > we can make sure to either only use plain syscalls or use memory from
> > the preallocated context. I think it'd be ok to not to translate in
> > that situation.
>
>
> I wrote a patch to improve this problem.
>
> +extern int errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +
> /*
> * errcontext() is typically called in error context callback functions, not
> * within an ereport() invocation. The callback function can be in a different
Can't we just reuse errmsg_internal etc?
Greetings,
Andres Freund