Thread: [BUG] Re-entering malloc problem when use --enable-nls build postgresql

[BUG] Re-entering malloc problem when use --enable-nls build postgresql

From
"158306855"
Date:
HI pg team:

I found that compiling postgresql with enable-nls may be introduce a problem

1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext  function)
3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.
5. This is the deadlock process stack:

Attachment
"=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:
> I found that compiling postgresql with enable-nls may be introduce a problem

> 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate
Language.
> 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext  function)
> 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to
processdeadlock. 

I can't get excited about this.  quickdie's attempt to report that it's
killing the process is necessarily a "best effort" undertaking, because
we cannot be sure that the process is in a good state.  In this situation,
it isn't.  --enable-nls might make the odds of that a bit worse, but we
could get such a failure regardless.

There are not any better alternatives.  We can't just set a flag in the
signal handler and hope that control will someday reach a place that
notices the flag.  We could exit without attempting to report anything,
but nobody would find that user-friendly.  So we try to report, in the
full understanding that sometimes it won't work.  That's one reason why
the postmaster has a timeout-and-then-hard-SIGKILL behavior.

            regards, tom lane


Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

From
Andres Freund
Date:
On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
> "=?ISO-8859-1?B?MTU4MzA2ODU1?=" <anderson2013@qq.com> writes:
> > I found that compiling postgresql with enable-nls may be introduce a problem
> 
> > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate
Language.
> > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext  function)
> > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to
processdeadlock.
 

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.


> I can't get excited about this.  quickdie's attempt to report that it's
> killing the process is necessarily a "best effort" undertaking, because
> we cannot be sure that the process is in a good state.  In this situation,
> it isn't.  --enable-nls might make the odds of that a bit worse, but we
> could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.


> There are not any better alternatives.  We can't just set a flag in the
> signal handler and hope that control will someday reach a place that
> notices the flag.  We could exit without attempting to report anything,
> but nobody would find that user-friendly.  So we try to report, in the
> full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context.  I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
>> There are not any better alternatives.  We can't just set a flag in the
>> signal handler and hope that control will someday reach a place that
>> notices the flag.  We could exit without attempting to report anything,
>> but nobody would find that user-friendly.  So we try to report, in the
>> full understanding that sometimes it won't work.

> It'd be fairly unproblematic to write an untranslated message out.

To stderr, maybe.  Across an SSL-encrypted client connection?  You're
dreaming.

            regards, tom lane


> It'd be fairly unproblematic to write an untranslated message out. There
> we can make sure to either only use plain syscalls or use memory from
> the preallocated context.  I think it'd be ok to not to translate in
> that situation.
I agree with this solution when build postgres with enable-nls.



------------------ Original ------------------
From:  "Andres Freund"<andres@anarazel.de>;
Date:  Tue, May 8, 2018 01:57 PM
To:  "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc:  "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject:  Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
> "158306855" <anderson2013@qq.com> writes:
> > I found that compiling postgresql with enable-nls may be introduce a problem
>
> > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
> > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext  function)
> > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.


> I can't get excited about this.  quickdie's attempt to report that it's
> killing the process is necessarily a "best effort" undertaking, because
> we cannot be sure that the process is in a good state.  In this situation,
> it isn't.  --enable-nls might make the odds of that a bit worse, but we
> could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.


> There are not any better alternatives.  We can't just set a flag in the
> signal handler and hope that control will someday reach a place that
> notices the flag.  We could exit without attempting to report anything,
> but nobody would find that user-friendly.  So we try to report, in the
> full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context.  I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund

Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

From
Andres Freund
Date:
On 2018-05-08 02:07:08 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
> >> There are not any better alternatives.  We can't just set a flag in the
> >> signal handler and hope that control will someday reach a place that
> >> notices the flag.  We could exit without attempting to report anything,
> >> but nobody would find that user-friendly.  So we try to report, in the
> >> full understanding that sometimes it won't work.
> 
> > It'd be fairly unproblematic to write an untranslated message out.
> 
> To stderr, maybe.  Across an SSL-encrypted client connection?  You're
> dreaming.

libpq invents an equivalent message when the server closes the
connection anyway, IIRC. So that'd not necessarily be too bad.

Greetings,

Andres Freund


Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

From
Andres Freund
Date:
On 2018-05-08 01:36:31 -0700, Andres Freund wrote:
> On 2018-05-08 02:07:08 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
> > >> There are not any better alternatives.  We can't just set a flag in the
> > >> signal handler and hope that control will someday reach a place that
> > >> notices the flag.  We could exit without attempting to report anything,
> > >> but nobody would find that user-friendly.  So we try to report, in the
> > >> full understanding that sometimes it won't work.
> > 
> > > It'd be fairly unproblematic to write an untranslated message out.
> > 
> > To stderr, maybe.  Across an SSL-encrypted client connection?  You're
> > dreaming.
> 
> libpq invents an equivalent message when the server closes the
> connection anyway, IIRC. So that'd not necessarily be too bad.

Oh, also: It looks like it'd actually be relatively easy to give openssl
its own memory allocator + pool:
Create a global 'openssl' memory context with preallocation, use
CRYPTO_set_mem_functions() to make openssl allocations go through small
wrapper functions around palloc/repalloc/pfree.

It's still not entirely kosher to call into openssl from a signal
handler because we could be inside openssl - but the window for that is
a lot smaller than being inside *any* memory allocation.

Greetings,

Andres Freund


On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-05-08 01:36:31 -0700, Andres Freund wrote:
>> On 2018-05-08 02:07:08 -0400, Tom Lane wrote:
>> > To stderr, maybe.  Across an SSL-encrypted client connection?  You're
>> > dreaming.
>>
>> libpq invents an equivalent message when the server closes the
>> connection anyway, IIRC. So that'd not necessarily be too bad.
>
> Oh, also: It looks like it'd actually be relatively easy to give openssl
> its own memory allocator + pool:
> Create a global 'openssl' memory context with preallocation, use
> CRYPTO_set_mem_functions() to make openssl allocations go through small
> wrapper functions around palloc/repalloc/pfree.
>
> It's still not entirely kosher to call into openssl from a signal
> handler because we could be inside openssl - but the window for that is
> a lot smaller than being inside *any* memory allocation.

Can't we use a more traditional signal handling style to defer
execution here?  1. Set an in_OpenSSL flag whenever you're about to
enter OpenSSL.  2.  In the SIGQUIT handler, if you find that flag is
set, then just set got_SIGQUIT and return.  3.  After you leave the
OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

From
Andres Freund
Date:
On 2018-05-09 09:30:58 +1200, Thomas Munro wrote:
> On Wed, May 9, 2018 at 7:51 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-05-08 01:36:31 -0700, Andres Freund wrote:
> >> On 2018-05-08 02:07:08 -0400, Tom Lane wrote:
> >> > To stderr, maybe.  Across an SSL-encrypted client connection?  You're
> >> > dreaming.
> >>
> >> libpq invents an equivalent message when the server closes the
> >> connection anyway, IIRC. So that'd not necessarily be too bad.
> >
> > Oh, also: It looks like it'd actually be relatively easy to give openssl
> > its own memory allocator + pool:
> > Create a global 'openssl' memory context with preallocation, use
> > CRYPTO_set_mem_functions() to make openssl allocations go through small
> > wrapper functions around palloc/repalloc/pfree.
> >
> > It's still not entirely kosher to call into openssl from a signal
> > handler because we could be inside openssl - but the window for that is
> > a lot smaller than being inside *any* memory allocation.
> 
> Can't we use a more traditional signal handling style to defer
> execution here?  1. Set an in_OpenSSL flag whenever you're about to
> enter OpenSSL.  2.  In the SIGQUIT handler, if you find that flag is
> set, then just set got_SIGQUIT and return.  3.  After you leave the
> OpenSSL code, if you see got_SIGQUIT, then run the handler explicitly.

Well, the question is if that'd ever have us defer killing the process
for longer. quickdie is intended to actually die quickly.  If there's
rekeying, a large COPY incoming, or something like that it could take a
bit. Aren't there also still places where we do blocking network IO?

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> Well, the question is if that'd ever have us defer killing the process
> for longer. quickdie is intended to actually die quickly.

Yeah.  Though now that we have the postmaster mechanism to wait-five-
seconds-then-SIGKILL, maybe we could rethink that requirement?  If we
reimplemented SIGQUIT to work more like SIGTERM, it would surely be
a lot safer.  There would be cases where a stuck backend wouldn't
respond and it'd eventually get SIGKILL'd, but in return we'd get rid
of problems like this one.

On balance I'm not sure whether that solution is more or less user-
friendly than the current setup, but for sure it'd be a lot easier
to reason about.

            regards, tom lane


Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

From
Andres Freund
Date:
On 2018-05-08 18:04:07 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Well, the question is if that'd ever have us defer killing the process
> > for longer. quickdie is intended to actually die quickly.
> 
> Yeah.  Though now that we have the postmaster mechanism to wait-five-
> seconds-then-SIGKILL, maybe we could rethink that requirement?  If we
> reimplemented SIGQUIT to work more like SIGTERM, it would surely be
> a lot safer.  There would be cases where a stuck backend wouldn't
> respond and it'd eventually get SIGKILL'd, but in return we'd get rid
> of problems like this one.

Right now we intentionally do not accept interrupts in a couple places
where we want to die quickly because we're making persistent changes. I
don't think it'd be good to continue e.g. committing any longer than
possible after one process segfaulted.  One counter-argument to that is
that the timing right now is far from synchronous either.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-05-08 18:04:07 -0400, Tom Lane wrote:
>> Yeah.  Though now that we have the postmaster mechanism to wait-five-
>> seconds-then-SIGKILL, maybe we could rethink that requirement?  If we
>> reimplemented SIGQUIT to work more like SIGTERM, it would surely be
>> a lot safer.  There would be cases where a stuck backend wouldn't
>> respond and it'd eventually get SIGKILL'd, but in return we'd get rid
>> of problems like this one.

> Right now we intentionally do not accept interrupts in a couple places
> where we want to die quickly because we're making persistent changes. I
> don't think it'd be good to continue e.g. committing any longer than
> possible after one process segfaulted.

Well, that's an implementation question.  I was sort of envisioning that
we could allow CHECK_FOR_INTERRUPTS to respond to a SIGQUIT interrupt
even when regular interrupts are disabled, reasoning that if we're at
a CHECK_FOR_INTERRUPTS at all, then we should be someplace where it's
more or less safe to trigger the exit.  If we had it like that, then
for example the commit sequence could put a CHECK_FOR_INTERRUPTS at
the last possible moment before committing, knowing that regular
interrupts are held off --- but if there's a SIGQUIT pending we'd
take it.

            regards, tom lane


It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context.  I think it'd be ok to not to translate in
that situation.

I wrote a patch to improve this problem.



zeng wenjing

------------------ Original ------------------
From:  "Andres Freund"<andres@anarazel.de>;
Date:  Tue, May 8, 2018 01:57 PM
To:  "Tom Lane"<tgl@sss.pgh.pa.us>;
Cc:  "158306855"<anderson2013@qq.com>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject:  Re: [BUG] Re-entering malloc problem when use --enable-nls buildpostgresql

On 2018-05-08 01:32:33 -0400, Tom Lane wrote:
> "158306855" <anderson2013@qq.com> writes:
> > I found that compiling postgresql with enable-nls may be introduce a problem
>
> > 1. When build postgresql with enable-nls (Native Language Support), postgresql use dgettext function to translate Language.
> > 2. The quickdie use dgettext translate message ; dgettext use malloc (in __dcigettext  function)
> > 3. When use pg_ctl -m fast to shutdown postgresql, pg backend process use function quickdie to shutdown database.
> > 4. Before receive quickdie signal, if backend process in malloc function and already have lock that will lead to process deadlock.

I don't think it's realistic to treat this is something we'll
necessarily backpatch. Any fix is going to be too complicated.


> I can't get excited about this.  quickdie's attempt to report that it's
> killing the process is necessarily a "best effort" undertaking, because
> we cannot be sure that the process is in a good state.  In this situation,
> it isn't.  --enable-nls might make the odds of that a bit worse, but we
> could get such a failure regardless.

As previously, I disagree with this. There's plenty ways to reach
quickdie, several where we are quite sure about the state. -m immediate
is a thing, and there's plenty situations where it's an entirely
reasonable choice.


> There are not any better alternatives.  We can't just set a flag in the
> signal handler and hope that control will someday reach a place that
> notices the flag.  We could exit without attempting to report anything,
> but nobody would find that user-friendly.  So we try to report, in the
> full understanding that sometimes it won't work.

It'd be fairly unproblematic to write an untranslated message out. There
we can make sure to either only use plain syscalls or use memory from
the preallocated context.  I think it'd be ok to not to translate in
that situation.

We should also use _exit or the like when exiting quickly, calling exit
handlers from a signal handlers is a bad bad idea.

Greetings,

Andres Freund
Attachment

Re: [BUG] Re-entering malloc problem when use --enable-nlsbuildpostgresql

From
Andres Freund
Date:
On 2018-05-21 10:49:53 +0800, 158306855 wrote:
> > It'd be fairly unproblematic to write an untranslated message out. There
> > we can make sure to either only use plain syscalls or use memory from
> > the preallocated context.  I think it'd be ok to not to translate in
> > that situation.
> 
> 
> I wrote a patch to improve this problem.
>  
> +extern int    errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int    errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int    errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +
>  /*
>   * errcontext() is typically called in error context callback functions, not
>   * within an ereport() invocation. The callback function can be in a different

Can't we just reuse errmsg_internal etc?

Greetings,

Andres Freund


> Can't we just reuse errmsg_internal etc?

Yes, the *_internal function can be reused. This is my modified patch.
Please take a look.


Zeng Wenjing

------------------ Original ------------------
From:  "Andres Freund"<andres@anarazel.de>;
Date:  Wed, May 23, 2018 05:11 AM
To:  "158306855"<anderson2013@qq.com>;
Cc:  "Tom Lane"<tgl@sss.pgh.pa.us>; "pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject:  Re: [BUG] Re-entering malloc problem when use --enable-nlsbuildpostgresql

On 2018-05-21 10:49:53 +0800, 158306855 wrote:
> > It'd be fairly unproblematic to write an untranslated message out. There
> > we can make sure to either only use plain syscalls or use memory from
> > the preallocated context.  I think it'd be ok to not to translate in
> > that situation.
>
>
> I wrote a patch to improve this problem.

> +extern int errmsg_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int errdetail_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +extern int errhint_no_translateit(const char *fmt,...) pg_attribute_printf(1, 2);
> +
>  /*
>   * errcontext() is typically called in error context callback functions, not
>   * within an ereport() invocation. The callback function can be in a different

Can't we just reuse errmsg_internal etc?

Greetings,

Andres Freund

Attachment