Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Date
Msg-id CA+TgmoaGF1_BQNUW-nyytS1fb1w-5cscor-RBuGogH_hguNmXQ@mail.gmail.com
Whole thread Raw
In response to Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
List pgsql-hackers
On Wed, Oct 1, 2014 at 11:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> As far as I can tell, it's configured to run everything.  I just
> checked, though, and found it wedged again.  I'm not sure whether it
> was the same problem, though; I ended up just killing all of the
> postgres processes to fix it.  We may be just at the beginning of an
> exciting debugging journey.

After much more time spent on this, and some help from two people
whose names are both six letters starting with "Andre", I discovered
the following:

- anole apparently had not tried to build master in a long time - I
think since May - because of a leftover "inst" directory instead
HEAD's buildroot.
- anole still refused to build either 9.4 or master until I removed a
couple of leftover socket directories in /tmp.
- after that it hung again trying to run the test_shm_mq regression
tests, in pretty much the same way as before.

The postmaster state looks like this:

0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) bt
#0  0xc00000000044a350:0 in _select_sys+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0xc00000000045f950:0 in select+0x90 () from /usr/lib/hpux64/libc.so.1
#2  0x4000000000a4e510:0 in ServerLoop () at postmaster.c:1540
#3  0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)   at postmaster.c:1223
#4  0x40000000008d0a90:0 in main (argc=3, argv=0x87ffffffffffd178)   at main.c:227
(gdb) p BackgroundWorkerList
$1 = {head = {next = 0x60000000000e8a58}}
(gdb) p BackgroundWorkerList.head
$2 = {next = 0x60000000000e8a58}
(gdb) p ((char *) BackgroundWorkerList.head) - 264
$3 = 0x60000000000e8950 "test_shm_mq"
(gdb) p (RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264)
$4 = (struct RegisteredBgWorker *) 0x60000000000e8950
(gdb) p ((RegisteredBgWorker *) (((char *) BackgroundWorkerList.head) - 264))[0]
$5 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\000 \000\00511985", '\000'
<repeats 13 times>, "\005T\000\000\000\000\000\001?\000
\000\00512038\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\005t\000\000\000\000\000\001?\000
\000\00511917", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051209",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511903", '\000' <repeats 13 times>,
"\005?\000\000\000\000\000\001?\000 \000\0051199", bgw_main_arg =
1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 12369, rw_child_slot = 35, rw_crashed_at = 0,
rw_shmem_slot= 6, rw_terminate = 0 '\000', rw_lnode = {next = 0x60000000000e8938}}
 
(gdb) p $5.rw_lnode
$6 = {next = 0x60000000000e8938}
(gdb) p ((RegisteredBgWorker *) (((char *) $5.rw_lnode) - 264))[0]
$7 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?x\000 \000\00511931", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?}\000
\000\00511939\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\004?\000\000\000\000\000\001?\000
\000\00511957", '\000' <repeats 13 times>,
"\004?\000\000\000\000\000\001?\000 \000\0051209",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\000
\000\00511956", '\000' <repeats 13 times>,
"\005\024\000\000\000\000\000\001?\000 \000\0051192", bgw_main_arg =
1069289198, bgw_notify_pid = 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot
=5, rw_terminate = 0 '\000', rw_lnode = {   next = 0x60000000000e8818}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $7.rw_lnode) - 264))[0]
$8 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?f\000 \000\00512051", '\000'
<repeats 13 times>, "\0044\000\000\000\000\000\001?g\000
\000\00512047\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\004T\000\000\000\000\000\001?j\000
\000\00511979", '\000' <repeats 13 times>,
"\004d\000\000\000\000\000\001?k\000 \000\0051197",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?m\000 \000\00511873", '\000'
<repeats 13 times>, "\004?\000\000\000\000\000\001?u\000
\000\0051196", bgw_main_arg = 1069289198, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 0, rw_shmem_slot = 4, rw_terminate = 0 '\000',
rw_lnode= {next = 0x60000000000e86f8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $8.rw_lnode) - 264))[0]
$9 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\\\000 \000\00512085", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?]\000
\000\00512076\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\003?\000\000\000\000\000\001?_\000
\000\00511885", '\000' <repeats 13 times>,
"\003?\000\000\000\000\000\001?`\000 \000\0051206",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?a\000 \000\00512065", '\000'
<repeats 13 times>, "\003?\000\000\000\000\000\001?c\000
\000\0051188", bgw_main_arg = 695156898, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665468967706, rw_shmem_slot = 3, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e85d8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $9.rw_lnode) - 264))[0]
$10 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001?\f\000 \000\00512045", '\000'
<repeats 13 times>, "\003\024\000\000\000\000\000\001?\016\000
\000\00512049\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\0034\000\000\000\000\000\001?=\000
\000\00512057", '\000' <repeats 13 times>,
"\003D\000\000\000\000\000\001?A\000 \000\0051208",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?B\000 \000\00512092", '\000'
<repeats 13 times>, "\003d\000\000\000\000\000\001?C\000
\000\0051192", bgw_main_arg = 507386864, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665463586145, rw_shmem_slot = 2, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e84b8}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $10.rw_lnode) - 264))[0]
$11 = {rw_worker = {   bgw_name = "test_shm_mq\000\000\001??\000 \000\00511937", '\000'
<repeats 13 times>, "\002?\000\000\000\000\000\001??\000
\000\00511941\000\000\000\000\000\000", bgw_flags = 1, bgw_start_time
= BgWorkerStart_ConsistentState,   bgw_restart_time = -1, bgw_main = 0,   bgw_library_name =
"test_shm_mq\000\000\000\002?\000\000\000\000\000\001?\000\000
\000\00511950", '\000' <repeats 13 times>,
"\002?\000\000\000\000\000\001?\002\000 \000\0051195",   bgw_function_name =
"test_shm_mq_main\000\000\000\000\000\001?\a\000 \000\00511960",
'\000' <repeats 13 times>, "\002?\000\000\000\000\000\001?\b\000
\000\0051198", bgw_main_arg = 1747912200, bgw_notify_pid = 12361},
rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462962221, rw_shmem_slot = 1, rw_terminate = 0
'\000',rw_lnode = {   next = 0x60000000000e8398}}
 
(gdb) p ((RegisteredBgWorker *) (((char *) $11.rw_lnode) - 264))[0]
$12 = {rw_worker = {   bgw_name = "test_shm_mq", '\000' <repeats 11 times>, "
\000\000\000\000\000\000\000 ", '\000' <repeats 23 times>,
"\021?\000\000\000\000\000\001??",   bgw_flags = 1, bgw_start_time = BgWorkerStart_ConsistentState,   bgw_restart_time
=-1, bgw_main = 0,   bgw_library_name =
 
"test_shm_mq\000\000\001??\000(\000\t12115_fsm", '\000' <repeats 17
times>, "\022(\000\000\000\000\000\001??\0000\000\020pg_i",   bgw_function_name = "test_shm_mq_main", '\000' <repeats
14times>,
 
"\0228\000\000\000\000\000\001?\003\000 \000\00552236", '\000'
<repeats 13 times>, "\022H", bgw_main_arg = 502019512, bgw_notify_pid
= 12361}, rw_backend = 0x0, rw_pid = 0, rw_child_slot = 0, rw_crashed_at = 465665462860980, rw_shmem_slot = 0,
rw_terminate= 0 '\000', rw_lnode = {next = 0x0}}
 
(gdb) fr 3
#3  0x4000000000a48b20:0 in PostmasterMain (argc=3, argv=0x87ffffffffffd178)   at postmaster.c:1223
1223            status = ServerLoop();
(gdb) p StartWorkerNeeded
$13 = 1 '\001'

So apparently commit a39e78b710eb588e102aedd2828611d7bc74714b, a good
idea though it certainly was, wasn't the root cause of this issue -
because the symptoms now seem identical to what I saw debugging
earlier hangs, but I verified that those changes are present here.

Further debugging reveals that sigusr1_handler() gets called
repeatedly, to start autovacuum workers, and it keeps waking up and
starting them.  But that doesn't cause the background workers to get
started either, because although sigusr1_handler() contains a call to
maybe_start_bgworker, it only does that if start_bgworker = true,
which only happens if
CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
which on these calls it isn't.
And I think this might also be the missing ingredient answering
Andres's question from before: why doesn't the 60-second
select()-timeout cause the background worker to eventually start even
if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
arriving about every 3 seconds, and I bet that's resetting the timer
every time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: replicating DROP commands across servers
Next
From: Heikki Linnakangas
Date:
Subject: Re: replicating DROP commands across servers