Thread: Stats Collector Won't Start
Can someone please provide a bit of information where the following error is coming from? This is PG 8.1.3 on AIX 5.3 LOG: could not bind socket for statistics collector: Permission denied LOG: disabling statistics collector for lack of working socket What exactly does the PG user not have appropriate permissions to? -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson <bnichols@ca.afilias.info> writes: > Can someone please provide a bit of information where the following > error is coming from? This is PG 8.1.3 on AIX 5.3 > LOG: could not bind socket for statistics collector: Permission denied That's bizarre. What error conditions does your man page for bind(2) document as yielding EACCES? The only one mentioned on my systems is "protected address", but we aren't requesting a reserved port number ... regards, tom lane
On Wed, 2006-10-18 at 13:00 -0400, Tom Lane wrote: > Brad Nicholson <bnichols@ca.afilias.info> writes: > > Can someone please provide a bit of information where the following > > error is coming from? This is PG 8.1.3 on AIX 5.3 > > > LOG: could not bind socket for statistics collector: Permission denied > > That's bizarre. What error conditions does your man page for bind(2) > document as yielding EACCES? The only one mentioned on my systems is > "protected address", but we aren't requesting a reserved port number ... > > regards, tom lane "The requested address is protected, and the current user does not have permission to access it." -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
I just had this problem last night.... my local firewall was blocking connections on the loopback interface. On Wed, 18 Oct 2006, Brad Nicholson wrote: > Can someone please provide a bit of information where the following > error is coming from? This is PG 8.1.3 on AIX 5.3 > > LOG: could not bind socket for statistics collector: Permission denied > LOG: disabling statistics collector for lack of working socket > > What exactly does the PG user not have appropriate permissions to? > > -- > Brad Nicholson 416-673-4106 > Database Administrator, Afilias Canada Corp. > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
Brad Nicholson <bnichols@ca.afilias.info> writes: > On Wed, 2006-10-18 at 13:00 -0400, Tom Lane wrote: >> That's bizarre. What error conditions does your man page for bind(2) >> document as yielding EACCES? The only one mentioned on my systems is >> "protected address", but we aren't requesting a reserved port number ... > "The requested address is protected, and the current user does not have > permission to access it." That's what the Linux man page says, too. Could getaddrinfo somehow be returning bogus data that includes a reserved port number? Would you try strace'ing postmaster start to see what gets passed to the socket() and bind() calls just before this message comes out? regards, tom lane
On Wed, 2006-10-18 at 14:31 -0400, Tom Lane wrote: > Brad Nicholson <bnichols@ca.afilias.info> writes: > > On Wed, 2006-10-18 at 13:00 -0400, Tom Lane wrote: > >> That's bizarre. What error conditions does your man page for bind(2) > >> document as yielding EACCES? The only one mentioned on my systems is > >> "protected address", but we aren't requesting a reserved port number ... > > > "The requested address is protected, and the current user does not have > > permission to access it." > > That's what the Linux man page says, too. Could getaddrinfo somehow be > returning bogus data that includes a reserved port number? > > Would you try strace'ing postmaster start to see what gets passed to the > socket() and bind() calls just before this message comes out? Here is the relative output from truss. open("/etc/hosts", O_RDONLY) = 7 kioctl(7, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY kfcntl(7, F_SETFD, 0x0000000000000001) = 0 kioctl(7, 22528, 0x0000000000000000, 0x0000000000000000) Err#25 ENOTTY kread(7, " # @ ( # ) 4 7\t 1 . 1".., 4096) = 2770 kread(7, " # @ ( # ) 4 7\t 1 . 1".., 4096) = 0 close(7) = 0 socket(2, 2, 0) = 7 bind(7, 0x00000001100E3BF0, 16) Err#13 EACCES kwrite(2, " 2 0 0 6 - 1 0 -", 8) = 8 kwrite(2, " 1 8 1 9 : 0 1", 8) = 8 kwrite(2, " : 0 1 . 7 2 5 ", 8) = 8 kwrite(2, " U T C [ 2 8 2", 8) = 8 kwrite(2, " 9 3 2 ] ", 8) = 8 kwrite(2, " L O G : c o", 8) = 8 kwrite(2, " u l d n o t ", 8) = 8 kwrite(2, " b i n d s o c", 8) = 8 kwrite(2, " k e t f o r ", 8) = 8 kwrite(2, " s t a t i s t i", 8) = 8 kwrite(2, " c s c o l l e", 8) = 8 kwrite(2, " c t o r : P e", 8) = 8 kwrite(2, " r m i s s i o n", 8) = 8 kwrite(2, " d e n i e d\n", 8) = 8 close(7) = 0 kwrite(2, " 2 0 0 6 - 1 0 -", 8) = 8 kwrite(2, " 1 8 1 9 : 0 1", 8) = 8 kwrite(2, " : 0 1 . 7 2 8 ", 8) = 8 kwrite(2, " U T C [ 2 8 2", 8) = 8 kwrite(2, " 9 3 2 ] ", 8) = 8 kwrite(2, " L O G : d i", 8) = 8 kwrite(2, " s a b l i n g ", 8) = 8 kwrite(2, " s t a t i s t i", 8) = 8 kwrite(2, " c s c o l l e", 8) = 8 kwrite(2, " c t o r f o r", 8) = 8 kwrite(2, " l a c k o f", 8) = 8 kwrite(2, " w o r k i n g", 8) = 8 kwrite(2, " s o c k e t\n", 8) = 8 -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson <bnichols@ca.afilias.info> writes: > On Wed, 2006-10-18 at 14:31 -0400, Tom Lane wrote: >> Would you try strace'ing postmaster start to see what gets passed to the >> socket() and bind() calls just before this message comes out? > Here is the relative output from truss. > socket(2, 2, 0) = 7 > bind(7, 0x00000001100E3BF0, 16) Err#13 EACCES Well, that's pretty odd --- I'd have expected to see something involving 127.0.0.1 (ie, 7F000001 in one byte order or another). Does your /etc/hosts file map "localhost" to something other than 127.0.0.1? Also, do regular connections to this postmaster work across TCP/IP? If getaddrinfo() is broken I'd expect there to be problems binding to the postmaster's listen socket too ... regards, tom lane
On Wed, 2006-10-18 at 15:59 -0400, Tom Lane wrote: > Brad Nicholson <bnichols@ca.afilias.info> writes: > > On Wed, 2006-10-18 at 14:31 -0400, Tom Lane wrote: > >> Would you try strace'ing postmaster start to see what gets passed to the > >> socket() and bind() calls just before this message comes out? > > > Here is the relative output from truss. > > > socket(2, 2, 0) = 7 > > bind(7, 0x00000001100E3BF0, 16) Err#13 EACCES > > Well, that's pretty odd --- I'd have expected to see something involving > 127.0.0.1 (ie, 7F000001 in one byte order or another). Does your > /etc/hosts file map "localhost" to something other than 127.0.0.1? Nope, it maps localhost to 127.0.0.1 > Also, do regular connections to this postmaster work across TCP/IP? > If getaddrinfo() is broken I'd expect there to be problems binding > to the postmaster's listen socket too ... Yes, TCP/IP connections work fine. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
tgl@sss.pgh.pa.us (Tom Lane) writes: > Brad Nicholson <bnichols@ca.afilias.info> writes: >> On Wed, 2006-10-18 at 14:31 -0400, Tom Lane wrote: >>> Would you try strace'ing postmaster start to see what gets passed to the >>> socket() and bind() calls just before this message comes out? > >> Here is the relative output from truss. > >> socket(2, 2, 0) = 7 >> bind(7, 0x00000001100E3BF0, 16) Err#13 EACCES > > Well, that's pretty odd --- I'd have expected to see something involving > 127.0.0.1 (ie, 7F000001 in one byte order or another). Does your > /etc/hosts file map "localhost" to something other than 127.0.0.1? Nope, it maps right. Note that the 0x00000001100E3BF0 value is not an IP address; it is a pointer to a (struct sockaddr *). > Also, do regular connections to this postmaster work across TCP/IP? > If getaddrinfo() is broken I'd expect there to be problems binding > to the postmaster's listen socket too ... There's evidently a difference between TCP and UDP, here. From a gdb run... 259 pgstat.c: No such file or directory. in pgstat.c (gdb) print addrs $7 = (struct addrinfo *) 0x1100e3af0 (gdb) print addrs[0] $8 = {ai_flags = 0, ai_family = 2, ai_socktype = 0, ai_protocol = 0, ai_addrlen = 16, ai_canonname = 0x0, ai_addr = 0x1100e3b30,ai_next = 0x0} (gdb) print addrs[1] $9 = {ai_flags = 1532713819, ai_family = 0, ai_socktype = 0, ai_protocol = 16, ai_addrlen = 1153484460985942017, ai_canonname= 0x0, ai_addr = 0x0, ai_next = 0x0} (gdb) print addrs[2] $10 = {ai_flags = 1, ai_family = 269388064, ai_socktype = 0, ai_protocol = 960, ai_addrlen = 0, ai_canonname = 0x0, ai_addr= 0x0, ai_next = 0x0} (gdb) print addrs[3] $11 = {ai_flags = 0, ai_family = 0, ai_socktype = 0, ai_protocol = 0, ai_addrlen = 0, ai_canonname = 0x0, ai_addr = 0x0,ai_next = 0x0} (gdb) print addr->ai_addrlen $16 = 16 (gdb) print addr->ai_addr->sa_family $17 = 2 '\002' (gdb) print addr->ai_addr->sa_data $18 = "\000\001\177\000\000\001\000\000\000\000\000\000\000" (gdb) print addr->ai_addr $19 = (struct sockaddr *) 0x1100e3b30 (gdb) step 300 in pgstat.c (gdb) step errstart (elevel=15, filename=0x10029f648 "pgstat.c", lineno=302, funcname=0x11000f258 "pgstat_init") at elog.c:149 149 elog.c: No such file or directory. in elog.c -- "cbbrowne","@","acm.org" http://cbbrowne.com/info/emacs.html They have finally found the most ultimately useless thing on the web... Found at the Victoria's Secret website: "The online shop: Text Only Listing"
Brad Nicholson <bnichols@ca.afilias.info> writes: > On Wed, 2006-10-18 at 15:59 -0400, Tom Lane wrote: >> Also, do regular connections to this postmaster work across TCP/IP? >> If getaddrinfo() is broken I'd expect there to be problems binding >> to the postmaster's listen socket too ... > Yes, TCP/IP connections work fine. Hmm, that may shoot holes in this theory, but --- I'm wondering if you've got a mismatch in struct addrinfo declarations. I found a page http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.commtechref/doc/commtrf2/getaddrinfo.htm that says AIX puts ai_canonname before ai_addr in struct addrinfo, which is opposite of what we assume in src/include/getaddrinfo.h. So I guess the next thing to look at is your pg_config.h to see if HAVE_GETADDRINFO and HAVE_STRUCT_ADDRINFO are defined? regards, tom lane
Chris Browne <cbbrowne@acm.org> writes: > (gdb) print addr->ai_addr->sa_data > $18 = "\000\001\177\000\000\001\000\000\000\000\000\000\000" Hmm, that looks a bit odd --- what's the full declaration of structs sockaddr and sockaddr_in on that machine? regards, tom lane
tgl@sss.pgh.pa.us (Tom Lane) writes: > Chris Browne <cbbrowne@acm.org> writes: >> (gdb) print addr->ai_addr->sa_data >> $18 = "\000\001\177\000\000\001\000\000\000\000\000\000\000" > > Hmm, that looks a bit odd --- what's the full declaration of structs > sockaddr and sockaddr_in on that machine? struct sockaddr { uchar_t sa_len; /* total length */ sa_family_t sa_family; /* address family */ char sa_data[14]; /* actually longer; address value */ }; struct sockaddr_in { uchar_t sin_len; sa_family_t sin_family; in_port_t sin_port; struct in_addr sin_addr; uchar_t sin_zero[8]; }; -- output = ("cbbrowne" "@" "cbbrowne.com") http://linuxdatabases.info/info/spreadsheets.html I think Java is the best language going today, which is to say, it's the marginally acceptable one among the set of complete bagbiting loser languages that we have to work with out here in the real world. -- Jamie Zawinski, http://www.jwz.org/doc/java.html
Chris Browne <cbbrowne@acm.org> writes: > tgl@sss.pgh.pa.us (Tom Lane) writes: >> Chris Browne <cbbrowne@acm.org> writes: >>> (gdb) print addr->ai_addr->sa_data >>> $18 = "\000\001\177\000\000\001\000\000\000\000\000\000\000" >> >> Hmm, that looks a bit odd --- what's the full declaration of structs >> sockaddr and sockaddr_in on that machine? > struct sockaddr { > uchar_t sa_len; /* total length */ > sa_family_t sa_family; /* address family */ > char sa_data[14]; /* actually longer; address value */ > }; > struct sockaddr_in { > uchar_t sin_len; > sa_family_t sin_family; > in_port_t sin_port; > struct in_addr sin_addr; > uchar_t sin_zero[8]; > }; So you've got sin_port = 1, which explains the permission-denied message. Next question is why getaddrinfo is setting it that way rather than to zero. Is it possible that getaddrinfo is failing to initialize the field at all, and we're just getting bit by random pre-existing memory contents? regards, tom lane
tgl@sss.pgh.pa.us (Tom Lane) writes: > Brad Nicholson <bnichols@ca.afilias.info> writes: >> On Wed, 2006-10-18 at 15:59 -0400, Tom Lane wrote: >>> Also, do regular connections to this postmaster work across TCP/IP? >>> If getaddrinfo() is broken I'd expect there to be problems binding >>> to the postmaster's listen socket too ... > >> Yes, TCP/IP connections work fine. > > Hmm, that may shoot holes in this theory, but --- I'm wondering if > you've got a mismatch in struct addrinfo declarations. I found a page > http://publib.boulder.ibm.com/infocenter/pseries/v5r3/topic/com.ibm.aix.commtechref/doc/commtrf2/getaddrinfo.htm > that says AIX puts ai_canonname before ai_addr in struct addrinfo, > which is opposite of what we assume in src/include/getaddrinfo.h. > So I guess the next thing to look at is your pg_config.h to see if > HAVE_GETADDRINFO and HAVE_STRUCT_ADDRINFO are defined? Yes, they surely are... pgdba@ydb2.int.libertyrms.com:/opt/rg/data_dba/src/pgsql-8.1.5 $ grep -i addrinfo src/include/pg_config.h /* Define to 1 if you have the `getaddrinfo' function. */ #define HAVE_GETADDRINFO 1 /* Define to 1 if the system has the type `struct addrinfo'. */ #define HAVE_STRUCT_ADDRINFO 1 -- (reverse (concatenate 'string "moc.enworbbc" "@" "enworbbc")) http://cbbrowne.com/info/spreadsheets.html "XFS might (or might not) come out before the year 3000. As far as kernel patches go, SGI are brilliant. As far as graphics, especially OpenGL, go, SGI is untouchable. As far as filing systems go, a concussed doormouse in a tarpit would move faster." -- jd on Slashdot
tgl@sss.pgh.pa.us (Tom Lane) writes: > Chris Browne <cbbrowne@acm.org> writes: >> tgl@sss.pgh.pa.us (Tom Lane) writes: >>> Chris Browne <cbbrowne@acm.org> writes: >>>> (gdb) print addr->ai_addr->sa_data >>>> $18 = "\000\001\177\000\000\001\000\000\000\000\000\000\000" >>> >>> Hmm, that looks a bit odd --- what's the full declaration of structs >>> sockaddr and sockaddr_in on that machine? > >> struct sockaddr { >> uchar_t sa_len; /* total length */ >> sa_family_t sa_family; /* address family */ >> char sa_data[14]; /* actually longer; address value */ >> }; > >> struct sockaddr_in { >> uchar_t sin_len; >> sa_family_t sin_family; >> in_port_t sin_port; >> struct in_addr sin_addr; >> uchar_t sin_zero[8]; >> }; > > So you've got sin_port = 1, which explains the permission-denied > message. Next question is why getaddrinfo is setting it that way > rather than to zero. Is it possible that getaddrinfo is failing > to initialize the field at all, and we're just getting bit by > random pre-existing memory contents? Ok, modifying to force the port value to 0... /* * Create the socket. */ if ((pgStatSock = socket(addr->ai_family, SOCK_DGRAM, 0)) < 0) { ereport(LOG, (errcode_for_socket_access(), errmsg("could not create socket for statistics collector: %m"))); continue; } addr->ai_addr->sa_data[0] = 0; /* cbbrowne: 2006-10-19 */ addr->ai_addr->sa_data[1] = 0; /* cbbrowne: 2006-10-19 */ /* * Bind it to a kernel assigned port on localhost and get the assigned * port via getsockname(). */ if (bind(pgStatSock, addr->ai_addr, addr->ai_addrlen) < 0) { ereport(LOG, (errcode_for_socket_access(), errmsg("could not bind socket for statistics collector: %m"))); closesocket(pgStatSock); pgStatSock = -1; continue; } Bad, old binary: pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ /opt/dbs/pgsql813-plTcl-slony115-AIX53-64bit-2006-02-14/bin/pg_ctl-D . start postmaster starting pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ LOG: could not bind socket for statistics collector: Permission denied LOG: disabling statistics collector for lack of working socket LOG: database system was shut down at 2006-10-19 16:18:13 UTC LOG: checkpoint record is at 0/38D3D0 LOG: redo record is at 0/38D3D0; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 565; next OID: 10794 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system is ready LOG: transaction ID wrap limit is 2147484146, limited by database "postgres" pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ /opt/dbs/pgsql813-plTcl-slony115-AIX53-64bit-2006-02-14/bin/pg_ctl-D . stop waiting for postmaster to shut down....LOG: received smart shutdown request LOG: shutting down LOG: database system is shut down done pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ /opt/dbs/pgsql815-AIX5300-05-02-64bit-TEST_BUILD/bin/pg_ctl -D .start postmaster starting pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ LOG: could not bind IPv4 socket: Address already in use HINT: Is another postmaster already running on port 9732? If not, wait a few seconds and retry. WARNING: could not create listen socket for "127.0.0.1" LOG: database system was shut down at 2006-10-19 16:25:11 UTC LOG: checkpoint record is at 0/38D470 LOG: redo record is at 0/38D470; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 565; next OID: 10794 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system is ready LOG: transaction ID wrap limit is 2147484146, limited by database "postgres" pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ psql -h localhost -p 9732 postgres Welcome to psql 7.4.12, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help on internal slash commands \g or terminate with semicolon to execute query \q to quit postgres=# begin; BEGIN postgres=# select * from pg_stat_activity ; datid | datname | procpid | usesysid | usename | current_query | query_start | backend_start | client_addr | client_port -------+----------+---------+----------+---------+------------------------------+-------------+-------------------------------+-------------+------------- 10793 | postgres | 352498 | 10 | pgrt | <command string not enabled> | | 2006-10-19 16:25:57.179199+00| 127.0.0.1 | 34178 (1 row) Something's a bit amok with the listen socket issue, but the stats collector is clearly running: pgrt@ydb1.int.libertyrms.com:/opt/rg/data_rt/scratch $ ps auxww | egrep '([p]ostmaster|[p]ostgres:)' pgrt 373030 0.0 0.0 3480 24996 pts/0 A 16:25:35 0:00 /opt/dbs/pgsql815-AIX5300-05-02-64bit-TEST_BUILD/bin/postmaster-D . pgrt 283052 0.0 0.0 3536 25052 pts/0 A 16:25:35 0:00 postgres: writer process pgrt 1081804 0.0 0.0 3756 25268 pts/0 A 16:25:35 0:00 postgres: stats collector process pgrt 1441812 0.0 0.0 3496 25008 pts/0 A 16:25:35 0:00 postgres: stats buffer process Is there a more elegant way of setting the requested port to 0 than my 2-liner? I expect so... -- let name="cbbrowne" and tld="linuxdatabases.info" in String.concat "@" [name;tld];; http://cbbrowne.com/info/ Thank you for onlining with ITS -- Be sure to patronize us again for your next fix.
Chris Browne <cbbrowne@acm.org> writes: > Is there a more elegant way of setting the requested port to 0 than my > 2-liner? What I'm tempted to do is add this to pg_getaddrinfo_all (in src/backend/libpq/ip.c): { /* not all versions of getaddrinfo() zero *result on failure */ *result = NULL; + #ifdef _AIX + /* it seems AIX's getaddrinfo doesn't reliably zero sin_port */ + if (servname == NULL) + servname = "0"; + #endif + #ifdef HAVE_UNIX_SOCKETS if (hintp->ai_family == AF_UNIX) return getaddrinfo_unix(servname, hintp, result); Want to try that? regards, tom lane
tgl@sss.pgh.pa.us (Tom Lane) writes: > Chris Browne <cbbrowne@acm.org> writes: >> Is there a more elegant way of setting the requested port to 0 than my >> 2-liner? > > What I'm tempted to do is add this to pg_getaddrinfo_all (in > src/backend/libpq/ip.c): > > { > /* not all versions of getaddrinfo() zero *result on failure */ > *result = NULL; > > + #ifdef _AIX > + /* it seems AIX's getaddrinfo doesn't reliably zero sin_port */ > + if (servname == NULL) > + servname = "0"; > + #endif > + > #ifdef HAVE_UNIX_SOCKETS > if (hintp->ai_family == AF_UNIX) > return getaddrinfo_unix(servname, hintp, result); > > Want to try that? That seems to work fine, too. And that seems somewhat more portable/elegant/something. -- let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];; http://linuxfinances.info/info/internet.html "Survival in a world of words is best made, if at all, through clever appeal to ambiguity." -- Robert Bolt
Chris Browne <cbbrowne@acm.org> writes: > tgl@sss.pgh.pa.us (Tom Lane) writes: >> What I'm tempted to do is add this to pg_getaddrinfo_all (in >> src/backend/libpq/ip.c): >> ... > That seems to work fine, too. And that seems somewhat more > portable/elegant/something. OK, patch applied as far back as 7.4 --- the 7.3 stats collector isn't affected because it's coded differently. regards, tom lane