Thread: bug in SignalSomeChildren

bug in SignalSomeChildren

From
Fujii Masao
Date:
Hi,

I found a bug which always prevents SignalSomeChildren with
BACKEND_TYPE_WALSND from sending a signal to walsender.

Though currently SignalSomeChildren with BACKEND_TYPE_WALSND
has not been called anywhere, it's not hard to believe that will
be called in the future. So we should apply the following change.

----------------------
diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 6f934ee..2d86fb6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3162,7 +3162,8 @@ SignalSomeChildren(int signal, int target)
               if (bp->dead_end)                       continue;
-               if (!(target & BACKEND_TYPE_NORMAL) && !bp->is_autovacuum)
+               if (!(target & BACKEND_TYPE_NORMAL) && !bp->is_autovacuum &&
+                       !IsPostmasterChildWalSender(bp->child_slot))                       continue;               if
(!(target& BACKEND_TYPE_AUTOVAC) && bp->is_autovacuum)                       continue;
 
----------------------

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Tue, Dec 14, 2010 at 10:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> I found a bug which always prevents SignalSomeChildren with
> BACKEND_TYPE_WALSND from sending a signal to walsender.
>
> Though currently SignalSomeChildren with BACKEND_TYPE_WALSND
> has not been called anywhere, it's not hard to believe that will
> be called in the future. So we should apply the following change.

I think the attached might be a little tidier.  Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of vie dic 17 10:08:04 -0300 2010:
> On Tue, Dec 14, 2010 at 10:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > I found a bug which always prevents SignalSomeChildren with
> > BACKEND_TYPE_WALSND from sending a signal to walsender.
> >
> > Though currently SignalSomeChildren with BACKEND_TYPE_WALSND
> > has not been called anywhere, it's not hard to believe that will
> > be called in the future. So we should apply the following change.
> 
> I think the attached might be a little tidier.  Thoughts?

Yeah, that looks more readable to me.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think the attached might be a little tidier.  Thoughts?

I'm not really thrilled at the idea of calling
IsPostmasterChildWalSender for every child whether or not it will have
any impact on the decision.  That involves touching shared memory which
can be rather expensive (see previous discussions about shared cache
lines and so forth).

The existing coding is pretty ugly, I agree.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Fri, Dec 17, 2010 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think the attached might be a little tidier.  Thoughts?
>
> I'm not really thrilled at the idea of calling
> IsPostmasterChildWalSender for every child whether or not it will have
> any impact on the decision.  That involves touching shared memory which
> can be rather expensive (see previous discussions about shared cache
> lines and so forth).

The existing code already does that, unless I'm missing something.  We
could improve on my proposed patch a bit by doing the is_autovacuum
test first and the walsender test second.  I'm not sure how to improve
on it beyond that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie dic 17 12:27:30 -0300 2010:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I think the attached might be a little tidier.  Thoughts?
> 
> I'm not really thrilled at the idea of calling
> IsPostmasterChildWalSender for every child whether or not it will have
> any impact on the decision.  That involves touching shared memory which
> can be rather expensive (see previous discussions about shared cache
> lines and so forth).

Is it possible to save the "is walsender" flag in the Backend struct?
That would make it possible to solve the problem very easily.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Dec 17, 2010 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm not really thrilled at the idea of calling
>> IsPostmasterChildWalSender for every child whether or not it will have
>> any impact on the decision. �That involves touching shared memory which
>> can be rather expensive (see previous discussions about shared cache
>> lines and so forth).

> The existing code already does that, unless I'm missing something.

No, it only makes that call when it's actually relevant to the decision
and it's exhausted all other possible tests.  In particular, the call is
never made for target = ALL which I think is the common case.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Is it possible to save the "is walsender" flag in the Backend struct?
> That would make it possible to solve the problem very easily.

Yeah, I was wondering about that too, but the problem is that the
postmaster doesn't know that at the time it forks the child.  The
flag in shared memory will get set later, but it's hard to tell
how much later.

Of course, that observation also means that anyplace the postmaster
tries to distinguish walsenders from other children is fundamentally
broken anyhow: a walsender that hasn't set the flag yet will get
treated like a regular backend.

I think what we ought to be looking to do is get rid of the distinction,
so that the postmaster treats walsenders the same as other children.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Fri, Dec 17, 2010 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think what we ought to be looking to do is get rid of the distinction,
> so that the postmaster treats walsenders the same as other children.

It's not apparent to me that the existing places where postmaster.c
makes that distinction are in fact correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Fujii Masao's message of mié dic 15 00:54:39 -0300 2010:
> Hi,
> 
> I found a bug which always prevents SignalSomeChildren with
> BACKEND_TYPE_WALSND from sending a signal to walsender.
> 
> Though currently SignalSomeChildren with BACKEND_TYPE_WALSND
> has not been called anywhere, it's not hard to believe that will
> be called in the future. So we should apply the following change.

BTW doesn't CountChildren have the same problem?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Tom Lane's message of vie dic 17 13:18:35 -0300 2010:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Is it possible to save the "is walsender" flag in the Backend struct?
> > That would make it possible to solve the problem very easily.
> 
> Yeah, I was wondering about that too, but the problem is that the
> postmaster doesn't know that at the time it forks the child.  The
> flag in shared memory will get set later, but it's hard to tell
> how much later.

Yeah, I arrived at the same conclusion.  I was wondering if we could
cache the result of the shared mem lookup the first time it was done,
but as you say there would still be a race condition.

> I think what we ought to be looking to do is get rid of the distinction,
> so that the postmaster treats walsenders the same as other children.

I think the problem with this is that walsenders are treated in a very
special way during shutdown -- they need to stay up until after bgwriter
is gone.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Excerpts from Tom Lane's message of vie dic 17 13:18:35 -0300 2010:
>> I think what we ought to be looking to do is get rid of the distinction,
>> so that the postmaster treats walsenders the same as other children.

> I think the problem with this is that walsenders are treated in a very
> special way during shutdown -- they need to stay up until after bgwriter
> is gone.

Why do they need to survive the bgwriter?  And more to the point, why
does that logic need to be implemented on the postmaster side?  Since
only the child process really knows reliably whether it's a walsender,
it'd be far safer if the behavioral difference could be handled on the
child side.  I haven't looked at the details, but I'm wondering if we
couldn't make this go by having walsender children react differently
to the same signals the postmaster sends other children.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Fri, Dec 17, 2010 at 11:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Excerpts from Tom Lane's message of vie dic 17 13:18:35 -0300 2010:
>>> I think what we ought to be looking to do is get rid of the distinction,
>>> so that the postmaster treats walsenders the same as other children.
>
>> I think the problem with this is that walsenders are treated in a very
>> special way during shutdown -- they need to stay up until after bgwriter
>> is gone.
>
> Why do they need to survive the bgwriter?  And more to the point, why
> does that logic need to be implemented on the postmaster side?  Since
> only the child process really knows reliably whether it's a walsender,
> it'd be far safer if the behavioral difference could be handled on the
> child side.  I haven't looked at the details, but I'm wondering if we
> couldn't make this go by having walsender children react differently
> to the same signals the postmaster sends other children.

I'm not too sure we're shutting down the WAL senders right now.  I
think they may just be exiting on their own when the postmaster goes
away.
               /*                * Emergency bailout if postmaster has died.  This is
to avoid the                * necessity for manual cleanup of all postmaster children.                */
if(!PostmasterIsAlive(true))                       exit(1); 

I'm having a bit of trouble confirming this on MacOS X, though.
Attaching gdb to either the startup process or a WAL sender causes
PostmasterIsAlive to return false, resulting in a near-immediate exit.Seems pretty stupid for attaching gdb to change
thereturn value of 
getppid() but it seems like that must be what's happening.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Heikki Linnakangas
Date:
On 17.12.2010 19:08, Robert Haas wrote:
> On Fri, Dec 17, 2010 at 11:43 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Alvaro Herrera<alvherre@commandprompt.com>  writes:
>>> Excerpts from Tom Lane's message of vie dic 17 13:18:35 -0300 2010:
>>>> I think what we ought to be looking to do is get rid of the distinction,
>>>> so that the postmaster treats walsenders the same as other children.
>>
>>> I think the problem with this is that walsenders are treated in a very
>>> special way during shutdown -- they need to stay up until after bgwriter
>>> is gone.
>>
>> Why do they need to survive the bgwriter?

Because we want the shutdown checkpoint record that bgwriter writes just 
before it dies to be replicated to the standbys.

>>  And more to the point, why
>> does that logic need to be implemented on the postmaster side?  Since
>> only the child process really knows reliably whether it's a walsender,
>> it'd be far safer if the behavioral difference could be handled on the
>> child side.  I haven't looked at the details, but I'm wondering if we
>> couldn't make this go by having walsender children react differently
>> to the same signals the postmaster sends other children.
>
> I'm not too sure we're shutting down the WAL senders right now.

Sure we do. postmaster sends walsenders SIGUSR2 when bgwriter dies. When 
a walsender receives SIGUSR2, it tries to send all pending WAL, and 
terminates after that. The postmaster goes into PM_SHUTDOWN_2 state, 
waiting for all the walsenders and the archiver process to die.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: bug in SignalSomeChildren

From
Fujii Masao
Date:
On Sat, Dec 18, 2010 at 1:24 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Fujii Masao's message of mié dic 15 00:54:39 -0300 2010:
>> Hi,
>>
>> I found a bug which always prevents SignalSomeChildren with
>> BACKEND_TYPE_WALSND from sending a signal to walsender.
>>
>> Though currently SignalSomeChildren with BACKEND_TYPE_WALSND
>> has not been called anywhere, it's not hard to believe that will
>> be called in the future. So we should apply the following change.
>
> BTW doesn't CountChildren have the same problem?

Oh, yes. We should fix that, too.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of vie dic 17 14:08:04 -0300 2010:

> I'm having a bit of trouble confirming this on MacOS X, though.
> Attaching gdb to either the startup process or a WAL sender causes
> PostmasterIsAlive to return false, resulting in a near-immediate exit.
>  Seems pretty stupid for attaching gdb to change the return value of
> getppid() but it seems like that must be what's happening.

Yeah, this problem has been known for some time and causes quite some
pain.  We have an open problem report on autovacuum failing to run after
some time, and we haven't been able to get a backtrace or strace because
of this issue -- trying to attach to it causes a full system restart
(PostmasterIsAlive returns false, autovac launcher dies, this death
causes postmaster to shut everything down and restart).  This is on
FreeBSD.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> ...  We have an open problem report on autovacuum failing to run after
> some time, and we haven't been able to get a backtrace or strace because
> of this issue ...

I wonder whether that's the already-fixed problem with autovacuum cost
limit going to zero in long-lived workers.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 8:20 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of vie dic 17 14:08:04 -0300 2010:
>
>> I'm having a bit of trouble confirming this on MacOS X, though.
>> Attaching gdb to either the startup process or a WAL sender causes
>> PostmasterIsAlive to return false, resulting in a near-immediate exit.
>>  Seems pretty stupid for attaching gdb to change the return value of
>> getppid() but it seems like that must be what's happening.
>
> Yeah, this problem has been known for some time and causes quite some
> pain.  We have an open problem report on autovacuum failing to run after
> some time, and we haven't been able to get a backtrace or strace because
> of this issue -- trying to attach to it causes a full system restart
> (PostmasterIsAlive returns false, autovac launcher dies, this death
> causes postmaster to shut everything down and restart).  This is on
> FreeBSD.

Can we add a develop option to force use of the kill(0) method?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
>>> Attaching gdb to either the startup process or a WAL sender causes
>>> PostmasterIsAlive to return false, resulting in a near-immediate exit.
>>> �Seems pretty stupid for attaching gdb to change the return value of
>>> getppid() but it seems like that must be what's happening.

> Can we add a develop option to force use of the kill(0) method?

How will that avoid needing to have an honest answer from getppid()?
Without that you can't know what to issue kill() against.

Seems like the correct path here is to complain to gdb and/or BSD
upstreams about this misbehavior.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>>> Attaching gdb to either the startup process or a WAL sender causes
>>>> PostmasterIsAlive to return false, resulting in a near-immediate exit.
>>>>  Seems pretty stupid for attaching gdb to change the return value of
>>>> getppid() but it seems like that must be what's happening.
>
>> Can we add a develop option to force use of the kill(0) method?
>
> How will that avoid needing to have an honest answer from getppid()?
> Without that you can't know what to issue kill() against.

The answer to this question will probably be entirely self-evident if
you stare at PostmasterIsAlive() for, well, it took me about 10
seconds.  So probably less than five for you.

> Seems like the correct path here is to complain to gdb and/or BSD
> upstreams about this misbehavior.

That might be a good thing to do too, but even if they agree to fix it
and do in fact fix it right away, it'll take months or years before
all of the major PostgreSQL contributors can benefit from those fixes,
as opposed to, say, this afternoon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Dec 20, 2010 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Can we add a develop option to force use of the kill(0) method?

>> How will that avoid needing to have an honest answer from getppid()?
>> Without that you can't know what to issue kill() against.

> The answer to this question will probably be entirely self-evident if
> you stare at PostmasterIsAlive() for, well, it took me about 10
> seconds.  So probably less than five for you.

Hmm, I was thinking that PostmasterPid was set originally from getppid,
but it looks like we rely on inheriting it through fork instead.
So maybe this will work.  It's still slower and less reliable than the
getppid case for normal use, though.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Dec 20, 2010 at 1:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> Can we add a develop option to force use of the kill(0) method?
>
>>> How will that avoid needing to have an honest answer from getppid()?
>>> Without that you can't know what to issue kill() against.
>
>> The answer to this question will probably be entirely self-evident if
>> you stare at PostmasterIsAlive() for, well, it took me about 10
>> seconds.  So probably less than five for you.
>
> Hmm, I was thinking that PostmasterPid was set originally from getppid,
> but it looks like we rely on inheriting it through fork instead.
> So maybe this will work.  It's still slower and less reliable than the
> getppid case for normal use, though.

Well, that's why it'd be a developer option, rather than the default
behavior.  If we can agree on a name I'll work up a patch.
Bikeshedding in 3...  2...  1...

check_postmaster_via_kill
avoid_backend_getppid
...?

Another option that might be workable (but I have reservations, and
haven't tested it either) is to check whether the return value of
getppid() is equal to 1.  If it's neither 1 nor PostmasterPid then try
kill().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Another option that might be workable (but I have reservations, and
> haven't tested it either) is to check whether the return value of
> getppid() is equal to 1.  If it's neither 1 nor PostmasterPid then try
> kill().

I like that better actually ... one less thing for developers to get wrong.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Another option that might be workable (but I have reservations, and
>> haven't tested it either) is to check whether the return value of
>> getppid() is equal to 1.  If it's neither 1 nor PostmasterPid then try
>> kill().
>
> I like that better actually ... one less thing for developers to get wrong.

The attached patch appears to work correctly on MacOS X.  I did check,
BTW: getppid() in the attached process returns gdb's pid.  Poor!

For my own purposes, I would be just as happy to apply this only to
master.  But I wonder if anyone wants to argue for back-patching, to
help debug existing installations?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Dec 20, 2010 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I like that better actually ... one less thing for developers to get wrong.

> The attached patch appears to work correctly on MacOS X.  I did check,
> BTW: getppid() in the attached process returns gdb's pid.  Poor!

Looks good to me.

> For my own purposes, I would be just as happy to apply this only to
> master.  But I wonder if anyone wants to argue for back-patching, to
> help debug existing installations?

Given the lack of non-developer complaints, I see no need to backpatch.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Dec 20, 2010 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I like that better actually ... one less thing for developers to get wrong.
>
>> The attached patch appears to work correctly on MacOS X.  I did check,
>> BTW: getppid() in the attached process returns gdb's pid.  Poor!
>
> Looks good to me.
>
>> For my own purposes, I would be just as happy to apply this only to
>> master.  But I wonder if anyone wants to argue for back-patching, to
>> help debug existing installations?
>
> Given the lack of non-developer complaints, I see no need to backpatch.

Well, non-developers don't tend to attach gdb very often.  Alvaro
mentioned a problem installation upthread, thus the question.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Martijn van Oosterhout
Date:
On Mon, Dec 20, 2010 at 03:08:02PM -0500, Robert Haas wrote:
> The attached patch appears to work correctly on MacOS X.  I did check,
> BTW: getppid() in the attached process returns gdb's pid.  Poor!

This appears to be a BSDism at least. On Linux and BSD derivatives the
man pages specifically mention the reparenting (needed for catching
signals) but on Linux getppid() is specifically documented to return
the correct value anyway.

Frankly it's a wart, for example strace/truss/whatever could (since
it's tracing anyway) just fudge the correct value in the getppid() call
so the userspace process doesn't notice. This has been a bug since
forever though, so I wouldn't hold my breath.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 3:36 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> Frankly it's a wart, for example strace/truss/whatever could (since
> it's tracing anyway) just fudge the correct value in the getppid() call
> so the userspace process doesn't notice. This has been a bug since
> forever though, so I wouldn't hold my breath.

Yeah, I have a vague recollection of noticing this in some other
context ~10 years ago.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Mon, Dec 20, 2010 at 3:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 20, 2010 at 3:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Mon, Dec 20, 2010 at 2:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I like that better actually ... one less thing for developers to get wrong.
>>
>>> The attached patch appears to work correctly on MacOS X.  I did check,
>>> BTW: getppid() in the attached process returns gdb's pid.  Poor!
>>
>> Looks good to me.
>>
>>> For my own purposes, I would be just as happy to apply this only to
>>> master.  But I wonder if anyone wants to argue for back-patching, to
>>> help debug existing installations?
>>
>> Given the lack of non-developer complaints, I see no need to backpatch.
>
> Well, non-developers don't tend to attach gdb very often.  Alvaro
> mentioned a problem installation upthread, thus the question.

Hearing no cries of "please-oh-please-backpatch-this", I've committed
it just to master.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Eric Ridge
Date:
On Mon, Dec 20, 2010 at 3:36 PM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Mon, Dec 20, 2010 at 03:08:02PM -0500, Robert Haas wrote:
>> The attached patch appears to work correctly on MacOS X.  I did check,
>> BTW: getppid() in the attached process returns gdb's pid.  Poor!
>
> This appears to be a BSDism at least. On Linux and BSD derivatives the
> man pages specifically mention the reparenting (needed for catching
> signals) but on Linux getppid() is specifically documented to return
> the correct value anyway.

I'm just a random lurker here, and happened to catch the last bit of
this thread.  Could one of you that understand this issue straighten
something out for me?

Every now and again we've been known to attach gdb to a production
Postgres backend to troubleshoot problems.  Ya know, just trying to
get an idea of what Postgres is actually doing via a backtrace.  This
is always on Linux, BTW.

Does this thread mean that the above no longer works with v9?  Or is
this only on non-Linux systems, or did the patch Robert Haas commit
"fix" fix?  We're still using 8.1 (slowly moving to 8.4) in
production, but have plans of picking up 9.x later in '11.  Just
wondering if we need to actually be a bit more careful in the future?

Thanks!

eric


Re: bug in SignalSomeChildren

From
Robert Haas
Date:
On Tue, Dec 21, 2010 at 1:45 PM, Eric Ridge <eebbrr@gmail.com> wrote:
> On Mon, Dec 20, 2010 at 3:36 PM, Martijn van Oosterhout
> <kleptog@svana.org> wrote:
>> On Mon, Dec 20, 2010 at 03:08:02PM -0500, Robert Haas wrote:
>>> The attached patch appears to work correctly on MacOS X.  I did check,
>>> BTW: getppid() in the attached process returns gdb's pid.  Poor!
>>
>> This appears to be a BSDism at least. On Linux and BSD derivatives the
>> man pages specifically mention the reparenting (needed for catching
>> signals) but on Linux getppid() is specifically documented to return
>> the correct value anyway.
>
> I'm just a random lurker here, and happened to catch the last bit of
> this thread.  Could one of you that understand this issue straighten
> something out for me?
>
> Every now and again we've been known to attach gdb to a production
> Postgres backend to troubleshoot problems.  Ya know, just trying to
> get an idea of what Postgres is actually doing via a backtrace.  This
> is always on Linux, BTW.
>
> Does this thread mean that the above no longer works with v9?  Or is
> this only on non-Linux systems, or did the patch Robert Haas commit
> "fix" fix?  We're still using 8.1 (slowly moving to 8.4) in
> production, but have plans of picking up 9.x later in '11.  Just
> wondering if we need to actually be a bit more careful in the future?

The point of the patch was to improve cases where attaching gdb
*didn't* work well.  Any cases where it was already working for you
aren't going to be made worse by this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: bug in SignalSomeChildren

From
Eric Ridge
Date:
On Tue, Dec 21, 2010 at 2:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The point of the patch was to improve cases where attaching gdb
> *didn't* work well.  Any cases where it was already working for you
> aren't going to be made worse by this.

Okay, great.  Thanks for the clarification.

eric


Re: bug in SignalSomeChildren

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of mar dic 21 08:40:49 -0300 2010:

> > Well, non-developers don't tend to attach gdb very often.  Alvaro
> > mentioned a problem installation upthread, thus the question.
> 
> Hearing no cries of "please-oh-please-backpatch-this", I've committed
> it just to master.

Please-oh-please backpatch this ... at least to 8.4.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: bug in SignalSomeChildren

From
Fujii Masao
Date:
On Sat, Dec 18, 2010 at 1:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 10:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I think the attached might be a little tidier.  Thoughts?
>>
>> I'm not really thrilled at the idea of calling
>> IsPostmasterChildWalSender for every child whether or not it will have
>> any impact on the decision.  That involves touching shared memory which
>> can be rather expensive (see previous discussions about shared cache
>> lines and so forth).
>
> The existing code already does that, unless I'm missing something.  We
> could improve on my proposed patch a bit by doing the is_autovacuum
> test first and the walsender test second.  I'm not sure how to improve
> on it beyond that.

How about doing target != ALL test at the head for the most common case
(target == ALL)? I added that test into your patch and changed it so that the
is_autovacuum test is done first.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: bug in SignalSomeChildren

From
Tom Lane
Date:
Fujii Masao <masao.fujii@gmail.com> writes:
> How about doing target != ALL test at the head for the most common case
> (target == ALL)?

That's an idea, but the test you propose implements it incorrectly.
        regards, tom lane


Re: bug in SignalSomeChildren

From
Fujii Masao
Date:
On Wed, Dec 22, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Fujii Masao <masao.fujii@gmail.com> writes:
>> How about doing target != ALL test at the head for the most common case
>> (target == ALL)?
>
> That's an idea, but the test you propose implements it incorrectly.

Thanks! I revised the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: bug in SignalSomeChildren

From
Bernd Helmle
Date:

--On 22. Dezember 2010 15:51:09 +0900 Fujii Masao <masao.fujii@gmail.com> 
wrote:

>>> How about doing target != ALL test at the head for the most common case
>>> (target == ALL)?
>>
>> That's an idea, but the test you propose implements it incorrectly.
>
> Thanks! I revised the patch.

I had a look at this for the current CF and the patch looks reasonable to 
me. Some testing shows that the changes are working as intended (at least, 
the wal sender actually receives now signals from SignalSomeChildren() as 
far as the DEBUG4 output shows). Maybe we should put in a small comment, 
why we special case BACKEND_TYPE_ALL (following Tom's comment about 
expensive shared memory access and IsPostmasterChildWalSender()). I marked 
it as "Ready for Committer".

Question for my understanding:

While reading the small patch, i realized that there's no 
BACKEND_TYPE_WALRECV or similar. If i understand correctly there's no need 
to handle it this way, since there's only one wal receiver process per 
instance?

-- 
Thanks
Bernd


Re: bug in SignalSomeChildren

From
Fujii Masao
Date:
On Thu, Jan 20, 2011 at 9:53 PM, Bernd Helmle <mailings@oopsware.de> wrote:
> I had a look at this for the current CF and the patch looks reasonable to
> me. Some testing shows that the changes are working as intended (at least,
> the wal sender actually receives now signals from SignalSomeChildren() as
> far as the DEBUG4 output shows).

Thanks for the review and test!

> Maybe we should put in a small comment, why
> we special case BACKEND_TYPE_ALL (following Tom's comment about expensive
> shared memory access and IsPostmasterChildWalSender()).

I added the following small comment. Patch attached.

+        /*
+         * Since target == BACKEND_TYPE_ALL is the most common case,
+         * we test it first and avoid touching shared memory for
+         * every child.
+         */

> Question for my understanding:
>
> While reading the small patch, i realized that there's no
> BACKEND_TYPE_WALRECV or similar. If i understand correctly there's no need
> to handle it this way, since there's only one wal receiver process per
> instance?

Yes. But also that's because walreceiver is an auxiliary process (like
bgwriter and
walwriter ..etc) but not a backend.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment