Thread: Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:
> Hamid@EDB; Can you please have someone configure anole to build git
> head as well as the other branches? Thanks.

The test_shm_mq regression tests hung on this machine this morning.
Hamid was able to give me access to log in and troubleshoot.
Unfortunately, I wasn't able to completely track down the problem
before accidentally killing off the running cluster, but it looks like
test_shm_mq_pipelined() tried to start 3 background workers and the
postmaster only ever launched one of them, so the test just sat there
and waited for the other two workers to start.  At this point, I have
no idea what could cause the postmaster to be asleep at the switch
like this, but it seems clear that's what happened.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> The test_shm_mq regression tests hung on this machine this morning.

It looks like hamster may have a repeatable issue there as well,
since the last set of DSM code changes.
        regards, tom lane



I wrote:
> It looks like hamster may have a repeatable issue there as well,
> since the last set of DSM code changes.

Ah, scratch that --- on closer inspection it looks like both failures
probably trace to out-of-disk-space.
        regards, tom lane



On Sat, May 10, 2014 at 6:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> The test_shm_mq regression tests hung on this machine this morning.
>
> It looks like hamster may have a repeatable issue there as well,
> since the last set of DSM code changes.
Yeah, this node has a limited amount of space available as it runs
only with a 4GB flash card... I just freed up inside it 200MB~ of
cached packages, let's hope that the run-of-space error is less
frequent when building on a branch. What is interesting btw is that it
only happens for a couple of contrib tests (pgcrypto, test_shm_mq),
and only on master branch.
-- 
Michael



On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:
>> Hamid@EDB; Can you please have someone configure anole to build git
>> head as well as the other branches? Thanks.
>
> The test_shm_mq regression tests hung on this machine this morning.
> Hamid was able to give me access to log in and troubleshoot.
> Unfortunately, I wasn't able to completely track down the problem
> before accidentally killing off the running cluster, but it looks like
> test_shm_mq_pipelined() tried to start 3 background workers and the
> postmaster only ever launched one of them, so the test just sat there
> and waited for the other two workers to start.  At this point, I have
> no idea what could cause the postmaster to be asleep at the switch
> like this, but it seems clear that's what happened.

This happened again, and I investigated further.  It looks like the
postmaster knows full well that it's supposed to start more bgworkers:
the ones that never get started are in the postmaster's
BackgroundWorkerList, and StartWorkerNeeded is true.  But it only
starts the first one, not all three.  Why?

Here's my theory.  When I did a backtrace inside the postmaster, it
was stuck inside inside select(), within ServerLoop().  I think that's
just where it was when the backend that wanted to run test_shm_mq
requested that a few background workers get launched.  Each
registration would have sent the postmaster a separate SIGUSR1, but
for some reason the postmaster only received one, which I think is
legit behavior, though possibly not typical on modern Linux systems.
When the SIGUSR1 arrived, the postmaster jumped into
sigusr1_handler().  sigusr1_handler() calls maybe_start_bgworker(),
which launched the first background worker.  Then it returned, and the
arrival of the signal did NOT interrupt the pending select().

This chain of events can't occur if an arriving SIGUSR1 causes
select() to return EINTR or EWOULDBLOCK, nor can it happen if the
signal handler is entered three separate times, once for each SIGUSR1.
That combination of explanations seems likely sufficient to explain
why this doesn't occur on other machines.

The code seems to have been this way since the commit that introduced
background workers (da07a1e856511dca59cbb1357616e26baa64428e),
although the function was called StartOneBackgroundWorker back then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
> On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:
> >> Hamid@EDB; Can you please have someone configure anole to build git
> >> head as well as the other branches? Thanks.
> >
> > The test_shm_mq regression tests hung on this machine this morning.
> > Hamid was able to give me access to log in and troubleshoot.
> > Unfortunately, I wasn't able to completely track down the problem
> > before accidentally killing off the running cluster, but it looks like
> > test_shm_mq_pipelined() tried to start 3 background workers and the
> > postmaster only ever launched one of them, so the test just sat there
> > and waited for the other two workers to start.  At this point, I have
> > no idea what could cause the postmaster to be asleep at the switch
> > like this, but it seems clear that's what happened.
> 
> This happened again, and I investigated further.  It looks like the
> postmaster knows full well that it's supposed to start more bgworkers:
> the ones that never get started are in the postmaster's
> BackgroundWorkerList, and StartWorkerNeeded is true.  But it only
> starts the first one, not all three.  Why?
> 
> Here's my theory.  When I did a backtrace inside the postmaster, it
> was stuck inside inside select(), within ServerLoop().  I think that's
> just where it was when the backend that wanted to run test_shm_mq
> requested that a few background workers get launched.  Each
> registration would have sent the postmaster a separate SIGUSR1, but
> for some reason the postmaster only received one, which I think is
> legit behavior, though possibly not typical on modern Linux systems.
> When the SIGUSR1 arrived, the postmaster jumped into
> sigusr1_handler().  sigusr1_handler() calls maybe_start_bgworker(),
> which launched the first background worker.  Then it returned, and the
> arrival of the signal did NOT interrupt the pending select().
> 
> This chain of events can't occur if an arriving SIGUSR1 causes
> select() to return EINTR or EWOULDBLOCK, nor can it happen if the
> signal handler is entered three separate times, once for each SIGUSR1.
> That combination of explanations seems likely sufficient to explain
> why this doesn't occur on other machines.
> 
> The code seems to have been this way since the commit that introduced
> background workers (da07a1e856511dca59cbb1357616e26baa64428e),
> although the function was called StartOneBackgroundWorker back then.

If that theory is true, wouldn't things get unstuck everytime a new
connection comes in? Or 60 seconds have passed? That's not to say this
isn't wrong, but still?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> This happened again, and I investigated further.  It looks like the
>> postmaster knows full well that it's supposed to start more bgworkers:
>> the ones that never get started are in the postmaster's
>> BackgroundWorkerList, and StartWorkerNeeded is true.  But it only
>> starts the first one, not all three.  Why?
>>
>> Here's my theory.  When I did a backtrace inside the postmaster, it
>> was stuck inside inside select(), within ServerLoop().  I think that's
>> just where it was when the backend that wanted to run test_shm_mq
>> requested that a few background workers get launched.  Each
>> registration would have sent the postmaster a separate SIGUSR1, but
>> for some reason the postmaster only received one, which I think is
>> legit behavior, though possibly not typical on modern Linux systems.
>> When the SIGUSR1 arrived, the postmaster jumped into
>> sigusr1_handler().  sigusr1_handler() calls maybe_start_bgworker(),
>> which launched the first background worker.  Then it returned, and the
>> arrival of the signal did NOT interrupt the pending select().
>>
>> This chain of events can't occur if an arriving SIGUSR1 causes
>> select() to return EINTR or EWOULDBLOCK, nor can it happen if the
>> signal handler is entered three separate times, once for each SIGUSR1.
>> That combination of explanations seems likely sufficient to explain
>> why this doesn't occur on other machines.
>>
>> The code seems to have been this way since the commit that introduced
>> background workers (da07a1e856511dca59cbb1357616e26baa64428e),
>> although the function was called StartOneBackgroundWorker back then.
>
> If that theory is true, wouldn't things get unstuck everytime a new
> connection comes in? Or 60 seconds have passed? That's not to say this
> isn't wrong, but still?

There aren't any going to be any new connections arriving when running
the contrib regression tests, I believe, so I don't think there is an
escape hatch there.  I didn't think to check how timeout was set in
ServerLoop, and it does look like the maximum ought to be 60 seconds,
so either there's some other ingredient I'm missing here, or the whole
theory is just wrong altogether.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2014-09-29 15:24:55 -0400, Robert Haas wrote:
> On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > If that theory is true, wouldn't things get unstuck everytime a new
> > connection comes in? Or 60 seconds have passed? That's not to say this
> > isn't wrong, but still?
> 
> There aren't any going to be any new connections arriving when running
> the contrib regression tests, I believe, so I don't think there is an
> escape hatch there.

I thought you might have tested to connect... And I'd guessed you'd have
reported if that had fixed it.

> I didn't think to check how timeout was set in
> ServerLoop, and it does look like the maximum ought to be 60 seconds,
> so either there's some other ingredient I'm missing here, or the whole
> theory is just wrong altogether.  :-(

Yea :(. Note how signals are blocked in all the signal handlers and only
unblocked for a very short time (the sleep).

(stare at random shit for far too long)

Ah. DetermineSleepTime(), which is called while signals are unblocked!,
modifies BackgroundWorkerList. Previously that only iterated the list,
without modifying it. That's already of quite debatable safety, but
modifying it without having blocked signals is most definitely
broken. The modification was introduced by 7f7485a0c...

If you can manually run stuff on that machine, it'd be rather helpful if
you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
HaveCrashedWorker() loop.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
> On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote:
> >> Hamid@EDB; Can you please have someone configure anole to build git
> >> head as well as the other branches? Thanks.
> >
> > The test_shm_mq regression tests hung on this machine this morning.
> > Hamid was able to give me access to log in and troubleshoot.
> > Unfortunately, I wasn't able to completely track down the problem
> > before accidentally killing off the running cluster, but it looks like
> > test_shm_mq_pipelined() tried to start 3 background workers and the
> > postmaster only ever launched one of them, so the test just sat there
> > and waited for the other two workers to start.  At this point, I have
> > no idea what could cause the postmaster to be asleep at the switch
> > like this, but it seems clear that's what happened.
> 
> This happened again, and I investigated further.  It looks like the
> postmaster knows full well that it's supposed to start more bgworkers:
> the ones that never get started are in the postmaster's
> BackgroundWorkerList, and StartWorkerNeeded is true.  But it only
> starts the first one, not all three.  Why?

Not necessarily related, but one interesting tidbit is that fork isn't
mentioned to be async signal safe on HP-UX:
http://nixdoc.net/man-pages/HP-UX/man5/thread_safety.5.html#Async%20Signal%20Safe

I have some doubts that fork() really could be not signal safe, but it's
a bit odd. IIRC posix requires fork() to be async safe, at least if
threads aren't present.

I'm generally baffled at all the stuff postmaster does in signal
handlers... ProcessConfigFile(), load_hba() et al. It's all done with
signals disabled, but still.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Yea :(. Note how signals are blocked in all the signal handlers and only
> unblocked for a very short time (the sleep).
>
> (stare at random shit for far too long)
>
> Ah. DetermineSleepTime(), which is called while signals are unblocked!,
> modifies BackgroundWorkerList. Previously that only iterated the list,
> without modifying it. That's already of quite debatable safety, but
> modifying it without having blocked signals is most definitely
> broken. The modification was introduced by 7f7485a0c...

Ouch.  OK, yeah, that's a bug.

> If you can manually run stuff on that machine, it'd be rather helpful if
> you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
> HaveCrashedWorker() loop.

I'd do it the other way around, and adjust ServerLoop to put the
PG_SETMASK calls right around pg_usleep() and select().  But why futz
with anole?  Let's just check in the fix.  It'll either fix anole or
not, but we should fix the bug you found either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Ah. DetermineSleepTime(), which is called while signals are unblocked!,
>> modifies BackgroundWorkerList. Previously that only iterated the list,
>> without modifying it. That's already of quite debatable safety, but
>> modifying it without having blocked signals is most definitely
>> broken. The modification was introduced by 7f7485a0c...

> Ouch.  OK, yeah, that's a bug.

Yeah.  Can we just postpone the signal unblock till after that function?
        regards, tom lane



On 2014-09-29 16:16:24 -0400, Robert Haas wrote:
> 
> > If you can manually run stuff on that machine, it'd be rather helpful if
> > you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the
> > HaveCrashedWorker() loop.
> 
> I'd do it the other way around, and adjust ServerLoop to put the
> PG_SETMASK calls right around pg_usleep() and select().

Sounds good.

> But why futz with anole?

Shorter feedback cycles. Anole doesn't seem to run very often, and it
takes you to log to see whether it's just slow or hanging...

> Let's just check in the fix.  It'll either fix anole or not, but we
> should fix the bug you found either way.

Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck
deep in another bug right now.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Mon, Sep 29, 2014 at 4:20 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Let's just check in the fix.  It'll either fix anole or not, but we
>> should fix the bug you found either way.
>
> Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck
> deep in another bug right now.

I probably can't do it until Wednesday, but I'll do it then if you
can't get to it first.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Andres Freund wrote:

> I'm generally baffled at all the stuff postmaster does in signal
> handlers... ProcessConfigFile(), load_hba() et al. It's all done with
> signals disabled, but still.

As far as I recall, the rationale for why this is acceptable is that the
whole of postmaster is run with signals blocked; they are only unblocked
during the sleeping select().

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I'm generally baffled at all the stuff postmaster does in signal
> > handlers... ProcessConfigFile(), load_hba() et al. It's all done with
> > signals disabled, but still.
> 
> As far as I recall, the rationale for why this is acceptable is that the
> whole of postmaster is run with signals blocked; they are only unblocked
> during the sleeping select().

Yea, I wrote that above :). Still seems remarkably fragile and
unnecessarily complex. The whole thing would be much simpler and
importantly easier to understand if everything would be done inside the
mainloop and the handlers just would set a latch...
But I guess that'd be a bit of large change to something as central as
postmaster's code..

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote:
>> As far as I recall, the rationale for why this is acceptable is that the
>> whole of postmaster is run with signals blocked; they are only unblocked
>> during the sleeping select().

> Yea, I wrote that above :). Still seems remarkably fragile and
> unnecessarily complex. The whole thing would be much simpler and
> importantly easier to understand if everything would be done inside the
> mainloop and the handlers just would set a latch...

Actually, I rather doubt that it would be either simpler or easier to
understand.  The reason to think about changing it, IMO, is the fear that
sooner or later we're going to file a bug against some platform's libc and
they're going to tell us to get lost because POSIX says that such-and-such
a library call isn't supported inside a signal handler.

> But I guess that'd be a bit of large change to something as central as
> postmaster's code..

Yeah.  It's a bit scary to consider changing this just to head off a
hypothetical portability problem, especially of a class that we've not
*actually* tripped across in nigh twenty years.  The closest thing that
I've seen to that is the valgrind bug we hit awhile back,
https://bugzilla.redhat.com/show_bug.cgi?id=1024162
        regards, tom lane



On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
> This happened again, and I investigated further.

Uh. Interestingly anole just succeeded twice:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE

I plan to commit the mask/unmask patch regardless, but it's curious. The
first of the two builds could have been you 'unsticking' it by manually
mucking around. Did you also do that for the second build?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
>> This happened again, and I investigated further.
>
> Uh. Interestingly anole just succeeded twice:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE
>
> I plan to commit the mask/unmask patch regardless, but it's curious. The
> first of the two builds could have been you 'unsticking' it by manually
> mucking around. Did you also do that for the second build?

No, but I think the failures have always been intermittent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2014-10-01 10:45:13 -0400, Robert Haas wrote:
> On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
> >> This happened again, and I investigated further.
> >
> > Uh. Interestingly anole just succeeded twice:
> > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE
> >
> > I plan to commit the mask/unmask patch regardless, but it's curious. The
> > first of the two builds could have been you 'unsticking' it by manually
> > mucking around. Did you also do that for the second build?
> 
> No, but I think the failures have always been intermittent.

There's no record of any relevantly failing builds on 9.4:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE
and none from master either:
http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

Is it setup for master now? Because it has reported back for 9.4 twice,
but never for master.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On Wed, Oct 1, 2014 at 10:50 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-01 10:45:13 -0400, Robert Haas wrote:
>> On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2014-09-29 14:46:20 -0400, Robert Haas wrote:
>> >> This happened again, and I investigated further.
>> >
>> > Uh. Interestingly anole just succeeded twice:
>> > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE
>> >
>> > I plan to commit the mask/unmask patch regardless, but it's curious. The
>> > first of the two builds could have been you 'unsticking' it by manually
>> > mucking around. Did you also do that for the second build?
>>
>> No, but I think the failures have always been intermittent.
>
> There's no record of any relevantly failing builds on 9.4:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE
> and none from master either:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD
>
> Is it setup for master now? Because it has reported back for 9.4 twice,
> but never for master.

As far as I can tell, it's configured to run everything.  I just
checked, though, and found it wedged again.  I'm not sure whether it
was the same problem, though; I ended up just killing all of the
postgres processes to fix it.  We may be just at the beginning of an
exciting debugging journey.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Oct 1, 2014 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As far as I can tell, it's configured to run everything.  I just
> checked, though, and found it wedged again.  I'm not sure whether it
> was the same problem, though; I ended up just killing all of the
> postgres processes to fix it.  We may be just at the beginning of an
> exciting debugging journey.

After much more time spent on this, and some help from two people
whose names are both six letters starting with "Andre", I discovered
the following:

- anole apparently had not tried to build master in a long time - I
think since May - because of a leftover "inst" directory instead
HEAD's buildroot.
- anole still refused to build either 9.4 or master until I removed a
couple of leftover socket directories in /tmp.
- after that it hung again trying to run the test_shm_mq regression
tests, in pretty much the same way as before.

The postmaster state looks like this:

0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0  0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0xc00000000045f950:0 in select+0x90 () from /usr/lib/hpux64/libc.so.1
#2  0x4000000000a4e510:0 in ServerLoop () at postmaster.c:1540
#3  0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)   at postmaster.c:1223
#4  0x40000000008d0a90:0 in main (argc=3, argv=0x87ffffffffffd178)   at main.c:227
(gdb) p BackgroundWorkerList
$1 = {head = {next = 0x60000000000e8a58}}
(gdb) p BackgroundWorkerList.head
$2 = {next = 0x60000000000e8a58}
(gdb) p ((char *) BackgroundWorkerList.head) - 264
$3 = 0x60000000000e8950 "test_shm_mq"
(gdb) p (RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264)
$4 = (struct RegisteredBgWorker *) 0x60000000000e8950
(gdb) p ((RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264))[0]
$5 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\000 \000\00511985", '\000'
<repeats 13 times>, "\005T\000\000\000\000\000\001?\000
\000\00512038\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\005t\000\000\000\000\000\001?\000
\000\00511917", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051209",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511903", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051199", bgw_main_arg =
1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 12369, rw_child_slot = 35, rw_crashed_at = 0,
rw_shmem_slot= 6, rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e8938}}
 
(gdb) p $5.rw_lnode
$6 = {next = 0x60000000000e8938}
(gdb) p ((RegisteredBgWorker *) (((char *) $5.rw_lnode) - 264))[0]
$7 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?x\000 \000\00511931", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?}\000
\000\00511939\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\004?\000\000\000\000\000\001?\000
\000\00511957", '\000' <repeats 13 times>,
"\004?\000\000\000\000\000\001?\000 \000\0051209",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511956", '\000' <repeats 13 times>,
"\005\024\000\000\000\000\000\001?\000 \000\0051192", bgw_main_arg =
1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot
=5, rw_terminate = 0 '\000', rw_lnode = {   next = 0x60000000000e8818}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $7.rw_lnode) - 264))[0]
$8 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?f\000 \000\00512051", '\000'
<repeats 13 times>, "\0044\000\000\000\000\000\001?g\000
\000\00512047\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\004T\000\000\000\000\000\001?j\000
\000\00511979", '\000' <repeats 13 times>,
"\004d\000\000\000\000\000\001?k\000 \000\0051197",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?m\000 \000\00511873", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?u\000
\000\0051196", bgw_main_arg = 1069289198, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 4, rw_terminate = 0 '\000',
rw_lnode= {next = 0x60000000000e86f8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $8.rw_lnode) - 264))[0]
$9 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\\\000 \000\00512085", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?]\000
\000\00512076\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\003?\000\000\000\000\000\001?_\000
\000\00511885", '\000' <repeats 13 times>,
"\003?\000\000\000\000\000\001?`\000 \000\0051206",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?a\000 \000\00512065", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?c\000
\000\0051188", bgw_main_arg = 695156898, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665468967706, rw_shmem_slot = 3, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e85d8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $9.rw_lnode) - 264))[0]
$10 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\f\000 \000\00512045", '\000'
<repeats 13 times>, "\003\024\000\000\000\000\000\001?\016\000
\000\00512049\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\0034\000\000\000\000\000\001?=\000
\000\00512057", '\000' <repeats 13 times>,
"\003D\000\000\000\000\000\001?A\000 \000\0051208",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?B\000 \000\00512092", '\000'
<repeats 13 times>, "\003d\000\000\000\000\000\001?C\000
\000\0051192", bgw_main_arg = 507386864, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665463586145, rw_shmem_slot = 2, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e84b8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $10.rw_lnode) - 264))[0]
$11 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001??\000 \000\00511937", '\000'
<repeats 13 times>, "\002?\000\000\000\000\000\001??\000
\000\00511941\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\002?\000\000\000\000\000\001?\000\000
\000\00511950", '\000' <repeats 13 times>,
"\002?\000\000\000\000\000\001?\002\000 \000\0051195",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\a\000 \000\00511960",
'\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\b\000
\000\0051198", bgw_main_arg = 1747912200, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462962221, rw_shmem_slot = 1, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e8398}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $11.rw_lnode) - 264))[0]
$12 = {rw_worker = {   bgw_name = "test_shm_mq", '\000' <repeats 11 times>, "
\000\000\000\000\000\000\000 ", '\000' <repeats 23 times>,
"\021?\000\000\000\000\000\001??",   bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState,   bgw_restart_time
=-1, bgw_main = 0,   bgw_library_name =
 
"test_shm_mq\000\000\001??\000(\000\t12115_fsm", '\000' <repeats 17
times>, "\022(\000\000\000\000\000\001??\0000\000\020pg_i",   bgw_function_name = "test_shm_mq_main", '\000' <repeats
14times>,
 
"\0228\000\000\000\000\000\001?\003\000 \000\00552236", '\000'
<repeats 13 times>, "\022H", bgw_main_arg = 502019512, bgw_notify_pid
= 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462860980, rw_shmem_slot = 0,
rw_terminate= 0 '\000', rw_lnode = {next = 0x0}}
 
(gdb) fr 3
#3  0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)   at postmaster.c:1223
1223            status = ServerLoop();
(gdb) p StartWorkerNeeded
$13 = 1 '\001'

So apparently commit a39e78b710eb588e102aedd2828611d7bc74714b, a good
idea though it certainly was, wasn't the root cause of this issue -
because the symptoms now seem identical to what I saw debugging
earlier hangs, but I verified that those changes are present here.

Further debugging reveals that sigusr1_handler() gets called
repeatedly, to start autovacuum workers, and it keeps waking up and
starting them.  But that doesn't cause the background workers to get
started either, because although sigusr1_handler() contains a call to
maybe_start_bgworker, it only does that if start_bgworker = true,
which only happens if
CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
which on these calls it isn't.
And I think this might also be the missing ingredient answering
Andres's question from before: why doesn't the 60-second
select()-timeout cause the background worker to eventually start even
if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
arriving about every 3 seconds, and I bet that's resetting the timer
every time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Further debugging reveals that sigusr1_handler() gets called
> repeatedly, to start autovacuum workers, and it keeps waking up and
> starting them.  But that doesn't cause the background workers to get
> started either, because although sigusr1_handler() contains a call to
> maybe_start_bgworker, it only does that if start_bgworker = true,
> which only happens if
> CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> which on these calls it isn't.
> And I think this might also be the missing ingredient answering
> Andres's question from before: why doesn't the 60-second
> select()-timeout cause the background worker to eventually start even
> if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
> arriving about every 3 seconds, and I bet that's resetting the timer
> every time.

For now I propose to address this by committing the attached patch,
which gets rid of the separate start_bgworker flag inside
sigusr1_handler() and instead uses the same StartWorkerNeeded ||
HaveCrashedWorker test that we use inside ServerLoop() to decide
whether to call maybe_start_bgworker().  Either more signals will
arrive (in which case the signal handler will launch an additional
background worker every time a signal arrives) or they won't (in which
case the 60-second timeout will eventually expire, and ServerLoop()
will kick into high gear and satisfy all outstanding requests).  This
isn't really right, because there might still be a quite noticeable
delay waiting for workers to get launched, but at least the delay
would be bounded to at most 60 seconds rather than, as at present,
potentially infinite.

A totally correct fix will require a bit more thought.  A search of
the git history reveals that the problem of a signal restarting the
timeout is not new: Tom fixed a similar problem back in 2007 by making
the autovacuum launcher sleep for at most a second at a time.  Such a
fix isn't ideal here, because we really don't want an up-to-1-second
delay launching a newly-registered background worker if there's a way
to avoid that -- it's probably OK for launching daemons, but it's not
so hot for parallel query.  However, we could:

(1) Use the self-pipe trick. We could not directly use a latch, at
least not without a new API, because we might be waiting on more than
one socket.

(2) Have the postmaster not set SA_RESTART for the sigusr1 handler.  I
don't know how platform-independent this approach would be.

(3) Have sigusr1_handler repeatedly call maybe_start_bgworker() until
StartWorkerNeeded becomes false, instead of just calling it once.
ServerLoop() is carefully coded to call maybe_start_bgworker() just
once per iteration, presumably to make sure the server stays
responsive even if the bgworker-starting machinery is quite busy;
looping inside the signal handler would give up that nice property
unless we had some way to break out of the loop if there's activity on
the socket.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On 2014-10-03 14:38:10 -0400, Robert Haas wrote:
> On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Further debugging reveals that sigusr1_handler() gets called
> > repeatedly, to start autovacuum workers, and it keeps waking up and
> > starting them.  But that doesn't cause the background workers to get
> > started either, because although sigusr1_handler() contains a call to
> > maybe_start_bgworker, it only does that if start_bgworker = true,
> > which only happens if
> > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> > which on these calls it isn't.
> > And I think this might also be the missing ingredient answering
> > Andres's question from before: why doesn't the 60-second
> > select()-timeout cause the background worker to eventually start even
> > if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
> > arriving about every 3 seconds, and I bet that's resetting the timer
> > every time.

Ick. There's a couple reports about HPUX behaving that way for select(),
so that seems like a reasonable theory. And a good explanation why only
anole was seing this.

> For now I propose to address this by committing the attached patch,
> which gets rid of the separate start_bgworker flag inside
> sigusr1_handler() and instead uses the same StartWorkerNeeded ||
> HaveCrashedWorker test that we use inside ServerLoop() to decide
> whether to call maybe_start_bgworker().

That seems sane.

> Either more signals will
> arrive (in which case the signal handler will launch an additional
> background worker every time a signal arrives) or they won't (in which
> case the 60-second timeout will eventually expire, and ServerLoop()
> will kick into high gear and satisfy all outstanding requests).  This
> isn't really right, because there might still be a quite noticeable
> delay waiting for workers to get launched, but at least the delay
> would be bounded to at most 60 seconds rather than, as at present,
> potentially infinite.

Right, it will somewhat fix the current issue. But aren't the
consequences of a potentially never/very seldomly run ServerLoop() more
severe? There's a couple more things done in that loop :(

> A totally correct fix will require a bit more thought.  A search of
> the git history reveals that the problem of a signal restarting the
> timeout is not new: Tom fixed a similar problem back in 2007 by making
> the autovacuum launcher sleep for at most a second at a time.  Such a
> fix isn't ideal here, because we really don't want an up-to-1-second
> delay launching a newly-registered background worker if there's a way
> to avoid that -- it's probably OK for launching daemons, but it's not
> so hot for parallel query.  However, we could:
> 
> (1) Use the self-pipe trick. We could not directly use a latch, at
> least not without a new API, because we might be waiting on more than
> one socket.

I think this is the way to go. And I think we should make the latch API
support that case, instead of reinventing it. That'll also have the
advantage of automatically using poll() (which is more efficient),
rather than select(), when supported by the platform.

> (2) Have the postmaster not set SA_RESTART for the sigusr1 handler.  I
> don't know how platform-independent this approach would be.

This makes me wary. Postmaster doesn't do too much stuff, but it'd still
be annoying to have to make sure everything always does loop around
EINTR.

Greetings,

Andres Freund



Hi,

I'm still seeing trouble with test_shm_mq on mipsel (9.4 rc1):

https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779

mips had the problem as well in the past (9.4 beta3):

https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370
https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135

The mips beta3 failures eventually went away when the build was done
on a different machine. This was the first time the mipsel build was
done on this build machine, so it seems the problem might well be
caused by some subarchitecture difference.

mipsel:
bad: Broadcom BCM91250A aka SWARM (mayer.debian.org)
good: Lemote 3A ITX-A1101 (Quad Core Loongson 3A) (mipsel-manda-01.debian.org)

mips:
bad: Broadcom BCM91250A aka SWARM (ball.debian.org)
good: EdgeRouter Pro (mips-aql-02.debian.org)
good: 16 x Cavium Octeon V0.3 (lucatelli.debian.org)

https://buildd.debian.org/status/logs.php?pkg=postgresql-9.4&arch=mipsel
https://buildd.debian.org/status/logs.php?pkg=postgresql-9.4&arch=mips
(Not all failures listed there are due to shm_mq, just the newer ones.)

At the moment all I have is the above build logs with the fact that
the build simply hangs there, i.e. I haven't seen the problem in a
context yet where I could have pulled a backtrace or the like.

Anyone got an idea?

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



On Mon, Nov 24, 2014 at 9:36 AM, Christoph Berg <cb@df7cb.de> wrote:
> I'm still seeing trouble with test_shm_mq on mipsel (9.4 rc1):

Boy, that test has certainly caught its share of bugs, and not in the
places I would have expected.

The last round of wrestling with this had to do with working around
HP-UX behavior that differs from Linux.  So it seems like this is
likely to be an altogether different failure than what we saw on
anole.

> https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779
>
> mips had the problem as well in the past (9.4 beta3):
>
> https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370
> https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135

For how long did was it hung before you killed it?

> The mips beta3 failures eventually went away when the build was done
> on a different machine. This was the first time the mipsel build was
> done on this build machine, so it seems the problem might well be
> caused by some subarchitecture difference.

Does it fail every time when run on a machine where it fails sometimes?

It might not be related to the subarchitecture difference in any
particularly interesting way; it could just be a race condition that
is triggered, or not, depending on the precise timing of things, which
might vary based on subarchitecture, compiler, running kernel version,
etc.

> Anyone got an idea?

Not off-hand.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Robert Haas 2014-11-24 <CA+TgmoacnppMdgg4n14U2BjujNDNMOU8xxHhPMvO+0u92ckH+w@mail.gmail.com>
> > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779
> >
> > mips had the problem as well in the past (9.4 beta3):
> >
> > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370
> > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135
> 
> For how long did was it hung before you killed it?

That was an automated build, killed "after 300 minutes of inactivity".
(I guess that means it didn't output anything on the terminal for that
long, but I've never looked closer into sbuild.)

> > The mips beta3 failures eventually went away when the build was done
> > on a different machine. This was the first time the mipsel build was
> > done on this build machine, so it seems the problem might well be
> > caused by some subarchitecture difference.
> 
> Does it fail every time when run on a machine where it fails sometimes?

So far there's a consistent host -> fail-or-not mapping, but most
mips/mipsel build hosts have seen only one attempt so far which
actually came so far to actually run the shm_mq test.

> It might not be related to the subarchitecture difference in any
> particularly interesting way; it could just be a race condition that
> is triggered, or not, depending on the precise timing of things, which
> might vary based on subarchitecture, compiler, running kernel version,
> etc.

Compiler and kernel should mostly be the same on all hosts (gcc from
Debian unstable inside the build chroots and kernel from stable on the
host system). The common bit on the failing hosts is that both mips
and mipsel are registered as "Broadcom BCM91250A aka SWARM" which is
hinting at a subarch problem. But of course that could just be a
coincidence.

Atm I don't have access to the boxes where it was failing (the builds
succeed on the mips(el) porter hosts available to Debian developers).
I'll see if I can arrange access there and run a test.

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: test_shm_mq failing on mips*

From
Christoph Berg
Date:
Re: To Robert Haas 2014-11-24 <20141124200824.GA22662@msg.df7cb.de>
> > Does it fail every time when run on a machine where it fails sometimes?
> 
> So far there's a consistent host -> fail-or-not mapping, but most
> mips/mipsel build hosts have seen only one attempt so far which
> actually came so far to actually run the shm_mq test.

I got the build rescheduled on the same machine and it's hanging
again.

> Atm I don't have access to the boxes where it was failing (the builds
> succeed on the mips(el) porter hosts available to Debian developers).
> I'll see if I can arrange access there and run a test.

Julien Cristau was so kind to poke into the hanging processes. The
build has been stuck now for about 4h, while that postgres backend has
only consumed 4s of CPU time (according to plain "ps"). The currently
executing query is:

SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)),
200,3);
 

(Waiting f, active, backend_start 6s older than xact_start, xact_start
same as query_start, state_change 19µs newer, all 4h old)

Backtrace:

Reading symbols from
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/contrib/test_shm_mq/tmp_check/install/usr/lib/postgresql/9.4/lib/test_shm_mq.so...done.
Loaded symbols for
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/contrib/test_shm_mq/tmp_check/install/usr/lib/postgresql/9.4/lib/test_shm_mq.so
0x76ea1d98 in poll () from /lib/mipsel-linux-gnu/libc.so.6
(gdb) where
#0  0x76ea1d98 in poll () from /lib/mipsel-linux-gnu/libc.so.6
#1  0x00683044 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fe2d084)   at
/usr/include/mipsel-linux-gnu/bits/poll2.h:41
#2  WaitLatchOrSocket (latch=0x766cae7c, wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:325
#3  0x6dc91968 in test_shm_mq_pipelined (fcinfo=<optimized out>)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../contrib/test_shm_mq/test.c:231
#4  0x005d448c in ExecMakeFunctionResultNoSets (fcache=0xda4658, econtext=0xda4590, isNull=0xda4c80 "hL\332",
isDone=<optimizedout>)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:2023
#5  0x005db0fc in ExecTargetList (isDone=0x7fe2d210, itemIsDone=0xda4d18, isnull=0xda4c80 "hL\332",    values=0xda4c70,
econtext=0xda4590,targetlist=0xda4d00)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:5304
#6  ExecProject (projInfo=<optimized out>, isDone=0x7fe2d210)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:5519
#7  0x005f08b8 in ExecResult (node=0xda4508)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/nodeResult.c:155
#8  0x005d308c in ExecProcNode (node=0xda4508)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execProcnode.c:373
#9  0x005cfb5c in ExecutePlan (dest=0xd9e0a0, direction=<optimized out>, numberTuples=0,    sendTuples=<optimized out>,
operation=CMD_SELECT,planstate=0xda4508, estate=0xda1640)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execMain.c:1475
#10 standard_ExecutorRun (queryDesc=0xd18240, direction=<optimized out>, count=0)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execMain.c:308
#11 0x00707bc0 in PortalRunSelect (portal=portal@entry=0xd85518, forward=forward@entry=1 '\001', count=0,
count@entry=2147483647,dest=dest@entry=0xd9e0a0)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/pquery.c:946
#12 0x0070959c in PortalRun (portal=0xd85518, count=2147483647, isTopLevel=<optimized out>, dest=0xd9e0a0,
altdest=0xd9e0a0,completionTag=0x7fe2d6bc "")   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/pquery.c:790
#13 0x007067b0 in exec_simple_query (   query_string=0xd594d0 "SELECT test_shm_mq_pipelined(16384, (select
string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);")   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/postgres.c:1045
#14 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0xcfa180 "contrib_regression",
username=<optimizedout>)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/postgres.c:4016
#15 0x00450888 in BackendRun (port=0xd16288)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:4123
#16 BackendStartup (port=0xd16288)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:3797
#17 ServerLoop ()   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:1576
#18 0x006983b8 in PostmasterMain (argc=8, argv=<optimized out>)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:1223
#19 0x0045256c in main (argc=8, argv=0xcf9890)   at
/build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/main/main.c:227

Kernel stack:

sudo cat /proc/29001/stack
[<ffffffff805cb7c0>] __schedule+0x300/0x9c0
[<ffffffff805cd598>] schedule_hrtimeout_range_clock+0xd8/0x100
[<ffffffff80238ca4>] poll_schedule_timeout+0x34/0x60
[<ffffffff8023a168>] do_sys_poll+0x320/0x470
[<ffffffff8023a408>] SyS_poll+0xe0/0x168
[<ffffffff801142b4>] handle_sys+0x114/0x134

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: test_shm_mq failing on mips*

From
Robert Haas
Date:
On Tue, Nov 25, 2014 at 10:42 AM, Christoph Berg <cb@df7cb.de> wrote:
> Re: To Robert Haas 2014-11-24 <20141124200824.GA22662@msg.df7cb.de>
>> > Does it fail every time when run on a machine where it fails sometimes?
>>
>> So far there's a consistent host -> fail-or-not mapping, but most
>> mips/mipsel build hosts have seen only one attempt so far which
>> actually came so far to actually run the shm_mq test.
>
> I got the build rescheduled on the same machine and it's hanging
> again.
>
>> Atm I don't have access to the boxes where it was failing (the builds
>> succeed on the mips(el) porter hosts available to Debian developers).
>> I'll see if I can arrange access there and run a test.
>
> Julien Cristau was so kind to poke into the hanging processes. The
> build has been stuck now for about 4h, while that postgres backend has
> only consumed 4s of CPU time (according to plain "ps"). The currently
> executing query is:
>
> SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from
generate_series(1,270000)),200, 3); 
>
> (Waiting f, active, backend_start 6s older than xact_start, xact_start
> same as query_start, state_change 19盜 newer, all 4h old)

I can't tell from this exactly what's going wrong.  Questions:

1. Are there any background worker processes running at the same time?If so, how many?  We'd expect to see 3.
2. Can we printout of the following variables in stack frame 3
(test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
inqh->mqh_queue[0], outqh->mqh_queue[0]
3. What does a backtrace of each background worker process look like?
If they are stuck inside copy_messages(), can you print *outqh, *inqh,
inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame?

Sorry for the hassle; I just don't have a better idea how to debug this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: test_shm_mq failing on mips*

From
Christoph Berg
Date:
Re: Robert Haas 2014-12-02 <CA+Tgmoa9HE=1GObU7MKuGLDBjBNSNRW0bDE4P3JA=P=9mqGgqw@mail.gmail.com>
> I can't tell from this exactly what's going wrong.  Questions:
> 
> 1. Are there any background worker processes running at the same time?
>  If so, how many?  We'd expect to see 3.
> 2. Can we printout of the following variables in stack frame 3
> (test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
> inqh->mqh_queue[0], outqh->mqh_queue[0]
> 3. What does a backtrace of each background worker process look like?
> If they are stuck inside copy_messages(), can you print *outqh, *inqh,
> inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame?
> 
> Sorry for the hassle; I just don't have a better idea how to debug this.

No problem, I'm aware this isn't an easy target.

I didn't have access to the build env myself, I'm not sure if I can
find someone to retry the test there. I'll let you know when I have
more data, but that will take a while (if at all :-/).

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/