Thread: Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: > Hamid@EDB; Can you please have someone configure anole to build git > head as well as the other branches? Thanks. The test_shm_mq regression tests hung on this machine this morning. Hamid was able to give me access to log in and troubleshoot. Unfortunately, I wasn't able to completely track down the problem before accidentally killing off the running cluster, but it looks like test_shm_mq_pipelined() tried to start 3 background workers and the postmaster only ever launched one of them, so the test just sat there and waited for the other two workers to start. At this point, I have no idea what could cause the postmaster to be asleep at the switch like this, but it seems clear that's what happened. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > The test_shm_mq regression tests hung on this machine this morning. It looks like hamster may have a repeatable issue there as well, since the last set of DSM code changes. regards, tom lane
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Tom Lane
Date:
I wrote: > It looks like hamster may have a repeatable issue there as well, > since the last set of DSM code changes. Ah, scratch that --- on closer inspection it looks like both failures probably trace to out-of-disk-space. regards, tom lane
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Michael Paquier
Date:
On Sat, May 10, 2014 at 6:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The test_shm_mq regression tests hung on this machine this morning. > > It looks like hamster may have a repeatable issue there as well, > since the last set of DSM code changes. Yeah, this node has a limited amount of space available as it runs only with a 4GB flash card... I just freed up inside it 200MB~ of cached packages, let's hope that the run-of-space error is less frequent when building on a branch. What is interesting btw is that it only happens for a couple of contrib tests (pgcrypto, test_shm_mq), and only on master branch. -- Michael
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: >> Hamid@EDB; Can you please have someone configure anole to build git >> head as well as the other branches? Thanks. > > The test_shm_mq regression tests hung on this machine this morning. > Hamid was able to give me access to log in and troubleshoot. > Unfortunately, I wasn't able to completely track down the problem > before accidentally killing off the running cluster, but it looks like > test_shm_mq_pipelined() tried to start 3 background workers and the > postmaster only ever launched one of them, so the test just sat there > and waited for the other two workers to start. At this point, I have > no idea what could cause the postmaster to be asleep at the switch > like this, but it seems clear that's what happened. This happened again, and I investigated further. It looks like the postmaster knows full well that it's supposed to start more bgworkers: the ones that never get started are in the postmaster's BackgroundWorkerList, and StartWorkerNeeded is true. But it only starts the first one, not all three. Why? Here's my theory. When I did a backtrace inside the postmaster, it was stuck inside inside select(), within ServerLoop(). I think that's just where it was when the backend that wanted to run test_shm_mq requested that a few background workers get launched. Each registration would have sent the postmaster a separate SIGUSR1, but for some reason the postmaster only received one, which I think is legit behavior, though possibly not typical on modern Linux systems. When the SIGUSR1 arrived, the postmaster jumped into sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), which launched the first background worker. Then it returned, and the arrival of the signal did NOT interrupt the pending select(). This chain of events can't occur if an arriving SIGUSR1 causes select() to return EINTR or EWOULDBLOCK, nor can it happen if the signal handler is entered three separate times, once for each SIGUSR1. That combination of explanations seems likely sufficient to explain why this doesn't occur on other machines. The code seems to have been this way since the commit that introduced background workers (da07a1e856511dca59cbb1357616e26baa64428e), although the function was called StartOneBackgroundWorker back then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: > >> Hamid@EDB; Can you please have someone configure anole to build git > >> head as well as the other branches? Thanks. > > > > The test_shm_mq regression tests hung on this machine this morning. > > Hamid was able to give me access to log in and troubleshoot. > > Unfortunately, I wasn't able to completely track down the problem > > before accidentally killing off the running cluster, but it looks like > > test_shm_mq_pipelined() tried to start 3 background workers and the > > postmaster only ever launched one of them, so the test just sat there > > and waited for the other two workers to start. At this point, I have > > no idea what could cause the postmaster to be asleep at the switch > > like this, but it seems clear that's what happened. > > This happened again, and I investigated further. It looks like the > postmaster knows full well that it's supposed to start more bgworkers: > the ones that never get started are in the postmaster's > BackgroundWorkerList, and StartWorkerNeeded is true. But it only > starts the first one, not all three. Why? > > Here's my theory. When I did a backtrace inside the postmaster, it > was stuck inside inside select(), within ServerLoop(). I think that's > just where it was when the backend that wanted to run test_shm_mq > requested that a few background workers get launched. Each > registration would have sent the postmaster a separate SIGUSR1, but > for some reason the postmaster only received one, which I think is > legit behavior, though possibly not typical on modern Linux systems. > When the SIGUSR1 arrived, the postmaster jumped into > sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), > which launched the first background worker. Then it returned, and the > arrival of the signal did NOT interrupt the pending select(). > > This chain of events can't occur if an arriving SIGUSR1 causes > select() to return EINTR or EWOULDBLOCK, nor can it happen if the > signal handler is entered three separate times, once for each SIGUSR1. > That combination of explanations seems likely sufficient to explain > why this doesn't occur on other machines. > > The code seems to have been this way since the commit that introduced > background workers (da07a1e856511dca59cbb1357616e26baa64428e), > although the function was called StartOneBackgroundWorker back then. If that theory is true, wouldn't things get unstuck everytime a new connection comes in? Or 60 seconds have passed? That's not to say this isn't wrong, but still? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> This happened again, and I investigated further. It looks like the >> postmaster knows full well that it's supposed to start more bgworkers: >> the ones that never get started are in the postmaster's >> BackgroundWorkerList, and StartWorkerNeeded is true. But it only >> starts the first one, not all three. Why? >> >> Here's my theory. When I did a backtrace inside the postmaster, it >> was stuck inside inside select(), within ServerLoop(). I think that's >> just where it was when the backend that wanted to run test_shm_mq >> requested that a few background workers get launched. Each >> registration would have sent the postmaster a separate SIGUSR1, but >> for some reason the postmaster only received one, which I think is >> legit behavior, though possibly not typical on modern Linux systems. >> When the SIGUSR1 arrived, the postmaster jumped into >> sigusr1_handler(). sigusr1_handler() calls maybe_start_bgworker(), >> which launched the first background worker. Then it returned, and the >> arrival of the signal did NOT interrupt the pending select(). >> >> This chain of events can't occur if an arriving SIGUSR1 causes >> select() to return EINTR or EWOULDBLOCK, nor can it happen if the >> signal handler is entered three separate times, once for each SIGUSR1. >> That combination of explanations seems likely sufficient to explain >> why this doesn't occur on other machines. >> >> The code seems to have been this way since the commit that introduced >> background workers (da07a1e856511dca59cbb1357616e26baa64428e), >> although the function was called StartOneBackgroundWorker back then. > > If that theory is true, wouldn't things get unstuck everytime a new > connection comes in? Or 60 seconds have passed? That's not to say this > isn't wrong, but still? There aren't any going to be any new connections arriving when running the contrib regression tests, I believe, so I don't think there is an escape hatch there. I didn't think to check how timeout was set in ServerLoop, and it does look like the maximum ought to be 60 seconds, so either there's some other ingredient I'm missing here, or the whole theory is just wrong altogether. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 15:24:55 -0400, Robert Haas wrote: > On Mon, Sep 29, 2014 at 2:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > If that theory is true, wouldn't things get unstuck everytime a new > > connection comes in? Or 60 seconds have passed? That's not to say this > > isn't wrong, but still? > > There aren't any going to be any new connections arriving when running > the contrib regression tests, I believe, so I don't think there is an > escape hatch there. I thought you might have tested to connect... And I'd guessed you'd have reported if that had fixed it. > I didn't think to check how timeout was set in > ServerLoop, and it does look like the maximum ought to be 60 seconds, > so either there's some other ingredient I'm missing here, or the whole > theory is just wrong altogether. :-( Yea :(. Note how signals are blocked in all the signal handlers and only unblocked for a very short time (the sleep). (stare at random shit for far too long) Ah. DetermineSleepTime(), which is called while signals are unblocked!, modifies BackgroundWorkerList. Previously that only iterated the list, without modifying it. That's already of quite debatable safety, but modifying it without having blocked signals is most definitely broken. The modification was introduced by 7f7485a0c... If you can manually run stuff on that machine, it'd be rather helpful if you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the HaveCrashedWorker() loop. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > On Fri, May 9, 2014 at 10:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, May 3, 2014 at 4:31 AM, Dave Page <dave.page@enterprisedb.com> wrote: > >> Hamid@EDB; Can you please have someone configure anole to build git > >> head as well as the other branches? Thanks. > > > > The test_shm_mq regression tests hung on this machine this morning. > > Hamid was able to give me access to log in and troubleshoot. > > Unfortunately, I wasn't able to completely track down the problem > > before accidentally killing off the running cluster, but it looks like > > test_shm_mq_pipelined() tried to start 3 background workers and the > > postmaster only ever launched one of them, so the test just sat there > > and waited for the other two workers to start. At this point, I have > > no idea what could cause the postmaster to be asleep at the switch > > like this, but it seems clear that's what happened. > > This happened again, and I investigated further. It looks like the > postmaster knows full well that it's supposed to start more bgworkers: > the ones that never get started are in the postmaster's > BackgroundWorkerList, and StartWorkerNeeded is true. But it only > starts the first one, not all three. Why? Not necessarily related, but one interesting tidbit is that fork isn't mentioned to be async signal safe on HP-UX: http://nixdoc.net/man-pages/HP-UX/man5/thread_safety.5.html#Async%20Signal%20Safe I have some doubts that fork() really could be not signal safe, but it's a bit odd. IIRC posix requires fork() to be async safe, at least if threads aren't present. I'm generally baffled at all the stuff postmaster does in signal handlers... ProcessConfigFile(), load_hba() et al. It's all done with signals disabled, but still. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Yea :(. Note how signals are blocked in all the signal handlers and only > unblocked for a very short time (the sleep). > > (stare at random shit for far too long) > > Ah. DetermineSleepTime(), which is called while signals are unblocked!, > modifies BackgroundWorkerList. Previously that only iterated the list, > without modifying it. That's already of quite debatable safety, but > modifying it without having blocked signals is most definitely > broken. The modification was introduced by 7f7485a0c... Ouch. OK, yeah, that's a bug. > If you can manually run stuff on that machine, it'd be rather helpful if > you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the > HaveCrashedWorker() loop. I'd do it the other way around, and adjust ServerLoop to put the PG_SETMASK calls right around pg_usleep() and select(). But why futz with anole? Let's just check in the fix. It'll either fix anole or not, but we should fix the bug you found either way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 29, 2014 at 3:37 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Ah. DetermineSleepTime(), which is called while signals are unblocked!, >> modifies BackgroundWorkerList. Previously that only iterated the list, >> without modifying it. That's already of quite debatable safety, but >> modifying it without having blocked signals is most definitely >> broken. The modification was introduced by 7f7485a0c... > Ouch. OK, yeah, that's a bug. Yeah. Can we just postpone the signal unblock till after that function? regards, tom lane
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 16:16:24 -0400, Robert Haas wrote: > > > If you can manually run stuff on that machine, it'd be rather helpful if > > you could put a PG_SETMASK(&BlockSig);...PG_SETMASK(&UnBlockSig); in the > > HaveCrashedWorker() loop. > > I'd do it the other way around, and adjust ServerLoop to put the > PG_SETMASK calls right around pg_usleep() and select(). Sounds good. > But why futz with anole? Shorter feedback cycles. Anole doesn't seem to run very often, and it takes you to log to see whether it's just slow or hanging... > Let's just check in the fix. It'll either fix anole or not, but we > should fix the bug you found either way. Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck deep in another bug right now. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Mon, Sep 29, 2014 at 4:20 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Let's just check in the fix. It'll either fix anole or not, but we >> should fix the bug you found either way. > > Right. Are you going to do it? I can, but it'll be tomorrow. I'm neck > deep in another bug right now. I probably can't do it until Wednesday, but I'll do it then if you can't get to it first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Alvaro Herrera
Date:
Andres Freund wrote: > I'm generally baffled at all the stuff postmaster does in signal > handlers... ProcessConfigFile(), load_hba() et al. It's all done with > signals disabled, but still. As far as I recall, the rationale for why this is acceptable is that the whole of postmaster is run with signals blocked; they are only unblocked during the sleeping select(). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I'm generally baffled at all the stuff postmaster does in signal > > handlers... ProcessConfigFile(), load_hba() et al. It's all done with > > signals disabled, but still. > > As far as I recall, the rationale for why this is acceptable is that the > whole of postmaster is run with signals blocked; they are only unblocked > during the sleeping select(). Yea, I wrote that above :). Still seems remarkably fragile and unnecessarily complex. The whole thing would be much simpler and importantly easier to understand if everything would be done inside the mainloop and the handlers just would set a latch... But I guess that'd be a bit of large change to something as central as postmaster's code.. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-09-29 18:44:34 -0300, Alvaro Herrera wrote: >> As far as I recall, the rationale for why this is acceptable is that the >> whole of postmaster is run with signals blocked; they are only unblocked >> during the sleeping select(). > Yea, I wrote that above :). Still seems remarkably fragile and > unnecessarily complex. The whole thing would be much simpler and > importantly easier to understand if everything would be done inside the > mainloop and the handlers just would set a latch... Actually, I rather doubt that it would be either simpler or easier to understand. The reason to think about changing it, IMO, is the fear that sooner or later we're going to file a bug against some platform's libc and they're going to tell us to get lost because POSIX says that such-and-such a library call isn't supported inside a signal handler. > But I guess that'd be a bit of large change to something as central as > postmaster's code.. Yeah. It's a bit scary to consider changing this just to head off a hypothetical portability problem, especially of a class that we've not *actually* tripped across in nigh twenty years. The closest thing that I've seen to that is the valgrind bug we hit awhile back, https://bugzilla.redhat.com/show_bug.cgi?id=1024162 regards, tom lane
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > This happened again, and I investigated further. Uh. Interestingly anole just succeeded twice: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE I plan to commit the mask/unmask patch regardless, but it's curious. The first of the two builds could have been you 'unsticking' it by manually mucking around. Did you also do that for the second build? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-09-29 14:46:20 -0400, Robert Haas wrote: >> This happened again, and I investigated further. > > Uh. Interestingly anole just succeeded twice: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE > > I plan to commit the mask/unmask patch regardless, but it's curious. The > first of the two builds could have been you 'unsticking' it by manually > mucking around. Did you also do that for the second build? No, but I think the failures have always been intermittent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-10-01 10:45:13 -0400, Robert Haas wrote: > On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-09-29 14:46:20 -0400, Robert Haas wrote: > >> This happened again, and I investigated further. > > > > Uh. Interestingly anole just succeeded twice: > > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE > > > > I plan to commit the mask/unmask patch regardless, but it's curious. The > > first of the two builds could have been you 'unsticking' it by manually > > mucking around. Did you also do that for the second build? > > No, but I think the failures have always been intermittent. There's no record of any relevantly failing builds on 9.4: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE and none from master either: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD Is it setup for master now? Because it has reported back for 9.4 twice, but never for master. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Wed, Oct 1, 2014 at 10:50 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-10-01 10:45:13 -0400, Robert Haas wrote: >> On Wed, Oct 1, 2014 at 7:00 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2014-09-29 14:46:20 -0400, Robert Haas wrote: >> >> This happened again, and I investigated further. >> > >> > Uh. Interestingly anole just succeeded twice: >> > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE >> > >> > I plan to commit the mask/unmask patch regardless, but it's curious. The >> > first of the two builds could have been you 'unsticking' it by manually >> > mucking around. Did you also do that for the second build? >> >> No, but I think the failures have always been intermittent. > > There's no record of any relevantly failing builds on 9.4: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=REL9_4_STABLE > and none from master either: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD > > Is it setup for master now? Because it has reported back for 9.4 twice, > but never for master. As far as I can tell, it's configured to run everything. I just checked, though, and found it wedged again. I'm not sure whether it was the same problem, though; I ended up just killing all of the postgres processes to fix it. We may be just at the beginning of an exciting debugging journey. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Wed, Oct 1, 2014 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > As far as I can tell, it's configured to run everything. I just > checked, though, and found it wedged again. I'm not sure whether it > was the same problem, though; I ended up just killing all of the > postgres processes to fix it. We may be just at the beginning of an > exciting debugging journey. After much more time spent on this, and some help from two people whose names are both six letters starting with "Andre", I discovered the following: - anole apparently had not tried to build master in a long time - I think since May - because of a leftover "inst" directory instead HEAD's buildroot. - anole still refused to build either 9.4 or master until I removed a couple of leftover socket directories in /tmp. - after that it hung again trying to run the test_shm_mq regression tests, in pretty much the same way as before. The postmaster state looks like this: 0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1 (gdb) bt #0 0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1 #1 0xc00000000045f950:0 in select+0x90 () from /usr/lib/hpux64/libc.so.1 #2 0x4000000000a4e510:0 in ServerLoop () at postmaster.c:1540 #3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178) at postmaster.c:1223 #4 0x40000000008d0a90:0 in main (argc=3, argv=0x87ffffffffffd178) at main.c:227 (gdb) p BackgroundWorkerList $1 = {head = {next = 0x60000000000e8a58}} (gdb) p BackgroundWorkerList.head $2 = {next = 0x60000000000e8a58} (gdb) p ((char *) BackgroundWorkerList.head) - 264 $3 = 0x60000000000e8950 "test_shm_mq" (gdb) p (RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264) $4 = (struct RegisteredBgWorker *) 0x60000000000e8950 (gdb) p ((RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264))[0] $5 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\000 \000\00511985", '\000' <repeats 13 times>, "\005T\000\000\000\000\000\001?\000 \000\00512038\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\005t\000\000\000\000\000\001?\000 \000\00511917", '\000' <repeats 13 times>, "\005?\000\000\000\000\000\001?\000 \000\0051209", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000 \000\00511903", '\000' <repeats 13 times>, "\005?\000\000\000\000\000\001?\000 \000\0051199", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 12369, rw_child_slot = 35, rw_crashed_at = 0, rw_shmem_slot= 6, rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e8938}} (gdb) p $5.rw_lnode $6 = {next = 0x60000000000e8938} (gdb) p ((RegisteredBgWorker *) (((char *) $5.rw_lnode) - 264))[0] $7 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?x\000 \000\00511931", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?}\000 \000\00511939\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\004?\000\000\000\000\000\001?\000 \000\00511957", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?\000 \000\0051209", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\000 \000\00511956", '\000' <repeats 13 times>, "\005\024\000\000\000\000\000\001?\000 \000\0051192", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot =5, rw_terminate = 0 '\000', rw_lnode = { next = 0x60000000000e8818}} (gdb) p ((RegisteredBgWorker *) (((char *) $7.rw_lnode) - 264))[0] $8 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?f\000 \000\00512051", '\000' <repeats 13 times>, "\0044\000\000\000\000\000\001?g\000 \000\00512047\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\004T\000\000\000\000\000\001?j\000 \000\00511979", '\000' <repeats 13 times>, "\004d\000\000\000\000\000\001?k\000 \000\0051197", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?m\000 \000\00511873", '\000' <repeats 13 times>, "\004?\000\000\000\000\000\001?u\000 \000\0051196", bgw_main_arg = 1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 4, rw_terminate = 0 '\000', rw_lnode= {next = 0x60000000000e86f8}} (gdb) p ((RegisteredBgWorker *) (((char *) $8.rw_lnode) - 264))[0] $9 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\\\000 \000\00512085", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?]\000 \000\00512076\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\003?\000\000\000\000\000\001?_\000 \000\00511885", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?`\000 \000\0051206", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?a\000 \000\00512065", '\000' <repeats 13 times>, "\003?\000\000\000\000\000\001?c\000 \000\0051188", bgw_main_arg = 695156898, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665468967706, rw_shmem_slot = 3, rw_terminate = 0 '\000',rw_lnode = { next = 0x60000000000e85d8}} (gdb) p ((RegisteredBgWorker *) (((char *) $9.rw_lnode) - 264))[0] $10 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001?\f\000 \000\00512045", '\000' <repeats 13 times>, "\003\024\000\000\000\000\000\001?\016\000 \000\00512049\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\0034\000\000\000\000\000\001?=\000 \000\00512057", '\000' <repeats 13 times>, "\003D\000\000\000\000\000\001?A\000 \000\0051208", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?B\000 \000\00512092", '\000' <repeats 13 times>, "\003d\000\000\000\000\000\001?C\000 \000\0051192", bgw_main_arg = 507386864, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665463586145, rw_shmem_slot = 2, rw_terminate = 0 '\000',rw_lnode = { next = 0x60000000000e84b8}} (gdb) p ((RegisteredBgWorker *) (((char *) $10.rw_lnode) - 264))[0] $11 = {rw_worker = { bgw_name = "test_shm_mq\000\000\001??\000 \000\00511937", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001??\000 \000\00511941\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time = -1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\000\002?\000\000\000\000\000\001?\000\000 \000\00511950", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\002\000 \000\0051195", bgw_function_name = "test_shm_mq_main\000\000\000\000\000\001?\a\000 \000\00511960", '\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\b\000 \000\0051198", bgw_main_arg = 1747912200, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462962221, rw_shmem_slot = 1, rw_terminate = 0 '\000',rw_lnode = { next = 0x60000000000e8398}} (gdb) p ((RegisteredBgWorker *) (((char *) $11.rw_lnode) - 264))[0] $12 = {rw_worker = { bgw_name = "test_shm_mq", '\000' <repeats 11 times>, " \000\000\000\000\000\000\000 ", '\000' <repeats 23 times>, "\021?\000\000\000\000\000\001??", bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState, bgw_restart_time =-1, bgw_main = 0, bgw_library_name = "test_shm_mq\000\000\001??\000(\000\t12115_fsm", '\000' <repeats 17 times>, "\022(\000\000\000\000\000\001??\0000\000\020pg_i", bgw_function_name = "test_shm_mq_main", '\000' <repeats 14times>, "\0228\000\000\000\000\000\001?\003\000 \000\00552236", '\000' <repeats 13 times>, "\022H", bgw_main_arg = 502019512, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462860980, rw_shmem_slot = 0, rw_terminate= 0 '\000', rw_lnode = {next = 0x0}} (gdb) fr 3 #3 0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178) at postmaster.c:1223 1223 status = ServerLoop(); (gdb) p StartWorkerNeeded $13 = 1 '\001' So apparently commit a39e78b710eb588e102aedd2828611d7bc74714b, a good idea though it certainly was, wasn't the root cause of this issue - because the symptoms now seem identical to what I saw debugging earlier hangs, but I verified that those changes are present here. Further debugging reveals that sigusr1_handler() gets called repeatedly, to start autovacuum workers, and it keeps waking up and starting them. But that doesn't cause the background workers to get started either, because although sigusr1_handler() contains a call to maybe_start_bgworker, it only does that if start_bgworker = true, which only happens if CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true, which on these calls it isn't. And I think this might also be the missing ingredient answering Andres's question from before: why doesn't the 60-second select()-timeout cause the background worker to eventually start even if the SELECT doesn't get interrupted? There seems to be a SIGUSR1 arriving about every 3 seconds, and I bet that's resetting the timer every time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Further debugging reveals that sigusr1_handler() gets called > repeatedly, to start autovacuum workers, and it keeps waking up and > starting them. But that doesn't cause the background workers to get > started either, because although sigusr1_handler() contains a call to > maybe_start_bgworker, it only does that if start_bgworker = true, > which only happens if > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true, > which on these calls it isn't. > And I think this might also be the missing ingredient answering > Andres's question from before: why doesn't the 60-second > select()-timeout cause the background worker to eventually start even > if the SELECT doesn't get interrupted? There seems to be a SIGUSR1 > arriving about every 3 seconds, and I bet that's resetting the timer > every time. For now I propose to address this by committing the attached patch, which gets rid of the separate start_bgworker flag inside sigusr1_handler() and instead uses the same StartWorkerNeeded || HaveCrashedWorker test that we use inside ServerLoop() to decide whether to call maybe_start_bgworker(). Either more signals will arrive (in which case the signal handler will launch an additional background worker every time a signal arrives) or they won't (in which case the 60-second timeout will eventually expire, and ServerLoop() will kick into high gear and satisfy all outstanding requests). This isn't really right, because there might still be a quite noticeable delay waiting for workers to get launched, but at least the delay would be bounded to at most 60 seconds rather than, as at present, potentially infinite. A totally correct fix will require a bit more thought. A search of the git history reveals that the problem of a signal restarting the timeout is not new: Tom fixed a similar problem back in 2007 by making the autovacuum launcher sleep for at most a second at a time. Such a fix isn't ideal here, because we really don't want an up-to-1-second delay launching a newly-registered background worker if there's a way to avoid that -- it's probably OK for launching daemons, but it's not so hot for parallel query. However, we could: (1) Use the self-pipe trick. We could not directly use a latch, at least not without a new API, because we might be waiting on more than one socket. (2) Have the postmaster not set SA_RESTART for the sigusr1 handler. I don't know how platform-independent this approach would be. (3) Have sigusr1_handler repeatedly call maybe_start_bgworker() until StartWorkerNeeded becomes false, instead of just calling it once. ServerLoop() is carefully coded to call maybe_start_bgworker() just once per iteration, presumably to make sure the server stays responsive even if the bgworker-starting machinery is quite busy; looping inside the signal handler would give up that nice property unless we had some way to break out of the loop if there's activity on the socket. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Andres Freund
Date:
On 2014-10-03 14:38:10 -0400, Robert Haas wrote: > On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Further debugging reveals that sigusr1_handler() gets called > > repeatedly, to start autovacuum workers, and it keeps waking up and > > starting them. But that doesn't cause the background workers to get > > started either, because although sigusr1_handler() contains a call to > > maybe_start_bgworker, it only does that if start_bgworker = true, > > which only happens if > > CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true, > > which on these calls it isn't. > > And I think this might also be the missing ingredient answering > > Andres's question from before: why doesn't the 60-second > > select()-timeout cause the background worker to eventually start even > > if the SELECT doesn't get interrupted? There seems to be a SIGUSR1 > > arriving about every 3 seconds, and I bet that's resetting the timer > > every time. Ick. There's a couple reports about HPUX behaving that way for select(), so that seems like a reasonable theory. And a good explanation why only anole was seing this. > For now I propose to address this by committing the attached patch, > which gets rid of the separate start_bgworker flag inside > sigusr1_handler() and instead uses the same StartWorkerNeeded || > HaveCrashedWorker test that we use inside ServerLoop() to decide > whether to call maybe_start_bgworker(). That seems sane. > Either more signals will > arrive (in which case the signal handler will launch an additional > background worker every time a signal arrives) or they won't (in which > case the 60-second timeout will eventually expire, and ServerLoop() > will kick into high gear and satisfy all outstanding requests). This > isn't really right, because there might still be a quite noticeable > delay waiting for workers to get launched, but at least the delay > would be bounded to at most 60 seconds rather than, as at present, > potentially infinite. Right, it will somewhat fix the current issue. But aren't the consequences of a potentially never/very seldomly run ServerLoop() more severe? There's a couple more things done in that loop :( > A totally correct fix will require a bit more thought. A search of > the git history reveals that the problem of a signal restarting the > timeout is not new: Tom fixed a similar problem back in 2007 by making > the autovacuum launcher sleep for at most a second at a time. Such a > fix isn't ideal here, because we really don't want an up-to-1-second > delay launching a newly-registered background worker if there's a way > to avoid that -- it's probably OK for launching daemons, but it's not > so hot for parallel query. However, we could: > > (1) Use the self-pipe trick. We could not directly use a latch, at > least not without a new API, because we might be waiting on more than > one socket. I think this is the way to go. And I think we should make the latch API support that case, instead of reinventing it. That'll also have the advantage of automatically using poll() (which is more efficient), rather than select(), when supported by the platform. > (2) Have the postmaster not set SA_RESTART for the sigusr1 handler. I > don't know how platform-independent this approach would be. This makes me wary. Postmaster doesn't do too much stuff, but it'd still be annoying to have to make sure everything always does loop around EINTR. Greetings, Andres Freund
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Christoph Berg
Date:
Hi, I'm still seeing trouble with test_shm_mq on mipsel (9.4 rc1): https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779 mips had the problem as well in the past (9.4 beta3): https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370 https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135 The mips beta3 failures eventually went away when the build was done on a different machine. This was the first time the mipsel build was done on this build machine, so it seems the problem might well be caused by some subarchitecture difference. mipsel: bad: Broadcom BCM91250A aka SWARM (mayer.debian.org) good: Lemote 3A ITX-A1101 (Quad Core Loongson 3A) (mipsel-manda-01.debian.org) mips: bad: Broadcom BCM91250A aka SWARM (ball.debian.org) good: EdgeRouter Pro (mips-aql-02.debian.org) good: 16 x Cavium Octeon V0.3 (lucatelli.debian.org) https://buildd.debian.org/status/logs.php?pkg=postgresql-9.4&arch=mipsel https://buildd.debian.org/status/logs.php?pkg=postgresql-9.4&arch=mips (Not all failures listed there are due to shm_mq, just the newer ones.) At the moment all I have is the above build logs with the fact that the build simply hangs there, i.e. I haven't seen the problem in a context yet where I could have pulled a backtrace or the like. Anyone got an idea? Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Robert Haas
Date:
On Mon, Nov 24, 2014 at 9:36 AM, Christoph Berg <cb@df7cb.de> wrote: > I'm still seeing trouble with test_shm_mq on mipsel (9.4 rc1): Boy, that test has certainly caught its share of bugs, and not in the places I would have expected. The last round of wrestling with this had to do with working around HP-UX behavior that differs from Linux. So it seems like this is likely to be an altogether different failure than what we saw on anole. > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779 > > mips had the problem as well in the past (9.4 beta3): > > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370 > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135 For how long did was it hung before you killed it? > The mips beta3 failures eventually went away when the build was done > on a different machine. This was the first time the mipsel build was > done on this build machine, so it seems the problem might well be > caused by some subarchitecture difference. Does it fail every time when run on a machine where it fails sometimes? It might not be related to the subarchitecture difference in any particularly interesting way; it could just be a race condition that is triggered, or not, depending on the precise timing of things, which might vary based on subarchitecture, compiler, running kernel version, etc. > Anyone got an idea? Not off-hand. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
From
Christoph Berg
Date:
Re: Robert Haas 2014-11-24 <CA+TgmoacnppMdgg4n14U2BjujNDNMOU8xxHhPMvO+0u92ckH+w@mail.gmail.com> > > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mipsel&ver=9.4~rc1-1&stamp=1416547779 > > > > mips had the problem as well in the past (9.4 beta3): > > > > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-3&stamp=1413607370 > > https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4&arch=mips&ver=9.4~beta3-1&stamp=1412893135 > > For how long did was it hung before you killed it? That was an automated build, killed "after 300 minutes of inactivity". (I guess that means it didn't output anything on the terminal for that long, but I've never looked closer into sbuild.) > > The mips beta3 failures eventually went away when the build was done > > on a different machine. This was the first time the mipsel build was > > done on this build machine, so it seems the problem might well be > > caused by some subarchitecture difference. > > Does it fail every time when run on a machine where it fails sometimes? So far there's a consistent host -> fail-or-not mapping, but most mips/mipsel build hosts have seen only one attempt so far which actually came so far to actually run the shm_mq test. > It might not be related to the subarchitecture difference in any > particularly interesting way; it could just be a race condition that > is triggered, or not, depending on the precise timing of things, which > might vary based on subarchitecture, compiler, running kernel version, > etc. Compiler and kernel should mostly be the same on all hosts (gcc from Debian unstable inside the build chroots and kernel from stable on the host system). The common bit on the failing hosts is that both mips and mipsel are registered as "Broadcom BCM91250A aka SWARM" which is hinting at a subarch problem. But of course that could just be a coincidence. Atm I don't have access to the boxes where it was failing (the builds succeed on the mips(el) porter hosts available to Debian developers). I'll see if I can arrange access there and run a test. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: To Robert Haas 2014-11-24 <20141124200824.GA22662@msg.df7cb.de> > > Does it fail every time when run on a machine where it fails sometimes? > > So far there's a consistent host -> fail-or-not mapping, but most > mips/mipsel build hosts have seen only one attempt so far which > actually came so far to actually run the shm_mq test. I got the build rescheduled on the same machine and it's hanging again. > Atm I don't have access to the boxes where it was failing (the builds > succeed on the mips(el) porter hosts available to Debian developers). > I'll see if I can arrange access there and run a test. Julien Cristau was so kind to poke into the hanging processes. The build has been stuck now for about 4h, while that postgres backend has only consumed 4s of CPU time (according to plain "ps"). The currently executing query is: SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)), 200,3); (Waiting f, active, backend_start 6s older than xact_start, xact_start same as query_start, state_change 19µs newer, all 4h old) Backtrace: Reading symbols from /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/contrib/test_shm_mq/tmp_check/install/usr/lib/postgresql/9.4/lib/test_shm_mq.so...done. Loaded symbols for /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/contrib/test_shm_mq/tmp_check/install/usr/lib/postgresql/9.4/lib/test_shm_mq.so 0x76ea1d98 in poll () from /lib/mipsel-linux-gnu/libc.so.6 (gdb) where #0 0x76ea1d98 in poll () from /lib/mipsel-linux-gnu/libc.so.6 #1 0x00683044 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fe2d084) at /usr/include/mipsel-linux-gnu/bits/poll2.h:41 #2 WaitLatchOrSocket (latch=0x766cae7c, wakeEvents=1, sock=-1, timeout=0) at pg_latch.c:325 #3 0x6dc91968 in test_shm_mq_pipelined (fcinfo=<optimized out>) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../contrib/test_shm_mq/test.c:231 #4 0x005d448c in ExecMakeFunctionResultNoSets (fcache=0xda4658, econtext=0xda4590, isNull=0xda4c80 "hL\332", isDone=<optimizedout>) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:2023 #5 0x005db0fc in ExecTargetList (isDone=0x7fe2d210, itemIsDone=0xda4d18, isnull=0xda4c80 "hL\332", values=0xda4c70, econtext=0xda4590,targetlist=0xda4d00) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:5304 #6 ExecProject (projInfo=<optimized out>, isDone=0x7fe2d210) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execQual.c:5519 #7 0x005f08b8 in ExecResult (node=0xda4508) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/nodeResult.c:155 #8 0x005d308c in ExecProcNode (node=0xda4508) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execProcnode.c:373 #9 0x005cfb5c in ExecutePlan (dest=0xd9e0a0, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT,planstate=0xda4508, estate=0xda1640) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execMain.c:1475 #10 standard_ExecutorRun (queryDesc=0xd18240, direction=<optimized out>, count=0) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/executor/execMain.c:308 #11 0x00707bc0 in PortalRunSelect (portal=portal@entry=0xd85518, forward=forward@entry=1 '\001', count=0, count@entry=2147483647,dest=dest@entry=0xd9e0a0) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/pquery.c:946 #12 0x0070959c in PortalRun (portal=0xd85518, count=2147483647, isTopLevel=<optimized out>, dest=0xd9e0a0, altdest=0xd9e0a0,completionTag=0x7fe2d6bc "") at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/pquery.c:790 #13 0x007067b0 in exec_simple_query ( query_string=0xd594d0 "SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int),'') from generate_series(1,270000)), 200, 3);") at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/postgres.c:1045 #14 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0xcfa180 "contrib_regression", username=<optimizedout>) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/tcop/postgres.c:4016 #15 0x00450888 in BackendRun (port=0xd16288) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:4123 #16 BackendStartup (port=0xd16288) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:3797 #17 ServerLoop () at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:1576 #18 0x006983b8 in PostmasterMain (argc=8, argv=<optimized out>) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/postmaster/postmaster.c:1223 #19 0x0045256c in main (argc=8, argv=0xcf9890) at /build/postgresql-9.4-cq5GAt/postgresql-9.4-9.4~rc1/build/../src/backend/main/main.c:227 Kernel stack: sudo cat /proc/29001/stack [<ffffffff805cb7c0>] __schedule+0x300/0x9c0 [<ffffffff805cd598>] schedule_hrtimeout_range_clock+0xd8/0x100 [<ffffffff80238ca4>] poll_schedule_timeout+0x34/0x60 [<ffffffff8023a168>] do_sys_poll+0x320/0x470 [<ffffffff8023a408>] SyS_poll+0xe0/0x168 [<ffffffff801142b4>] handle_sys+0x114/0x134 Christoph -- cb@df7cb.de | http://www.df7cb.de/
On Tue, Nov 25, 2014 at 10:42 AM, Christoph Berg <cb@df7cb.de> wrote: > Re: To Robert Haas 2014-11-24 <20141124200824.GA22662@msg.df7cb.de> >> > Does it fail every time when run on a machine where it fails sometimes? >> >> So far there's a consistent host -> fail-or-not mapping, but most >> mips/mipsel build hosts have seen only one attempt so far which >> actually came so far to actually run the shm_mq test. > > I got the build rescheduled on the same machine and it's hanging > again. > >> Atm I don't have access to the boxes where it was failing (the builds >> succeed on the mips(el) porter hosts available to Debian developers). >> I'll see if I can arrange access there and run a test. > > Julien Cristau was so kind to poke into the hanging processes. The > build has been stuck now for about 4h, while that postgres backend has > only consumed 4s of CPU time (according to plain "ps"). The currently > executing query is: > > SELECT test_shm_mq_pipelined(16384, (select string_agg(chr(32+(random()*95)::int), '') from generate_series(1,270000)),200, 3); > > (Waiting f, active, backend_start 6s older than xact_start, xact_start > same as query_start, state_change 19盜 newer, all 4h old) I can't tell from this exactly what's going wrong. Questions: 1. Are there any background worker processes running at the same time?If so, how many? We'd expect to see 3. 2. Can we printout of the following variables in stack frame 3 (test_shm_mq_pipelined)? send_count, loop_count, *outqh, *inqh, inqh->mqh_queue[0], outqh->mqh_queue[0] 3. What does a backtrace of each background worker process look like? If they are stuck inside copy_messages(), can you print *outqh, *inqh, inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame? Sorry for the hassle; I just don't have a better idea how to debug this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Robert Haas 2014-12-02 <CA+Tgmoa9HE=1GObU7MKuGLDBjBNSNRW0bDE4P3JA=P=9mqGgqw@mail.gmail.com> > I can't tell from this exactly what's going wrong. Questions: > > 1. Are there any background worker processes running at the same time? > If so, how many? We'd expect to see 3. > 2. Can we printout of the following variables in stack frame 3 > (test_shm_mq_pipelined)? send_count, loop_count, *outqh, *inqh, > inqh->mqh_queue[0], outqh->mqh_queue[0] > 3. What does a backtrace of each background worker process look like? > If they are stuck inside copy_messages(), can you print *outqh, *inqh, > inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame? > > Sorry for the hassle; I just don't have a better idea how to debug this. No problem, I'm aware this isn't an easy target. I didn't have access to the build env myself, I'm not sure if I can find someone to retry the test there. I'll let you know when I have more data, but that will take a while (if at all :-/). Christoph -- cb@df7cb.de | http://www.df7cb.de/