Thread: Hung postmaster (8.3.9)

Hung postmaster (8.3.9)

From
"Ed L."
Date:
Need some help.

My PostgreSQL 8.3.6 and now 8.3.9 postmaster is hanging.
No idea why, been running like a top for a year.
Can't do "select version()", even hung after system reboot.
SIGINT/QUIT/TERM have no effect, only SIGKILL can stop it.

This is Linux 2.6.18-92.1.22.el5 SMP x86_64

Here's the log after restarting from SIGKILL:

2010-02-27 20:11:10.426 CST [23134]    LOG:  database system was interrupted; last known up at 2010-02-27 20:08:46 CST
2010-02-27 20:11:10.426 CST [23134]    DEBUG:  checkpoint record is at 8BC/63F5FC0
2010-02-27 20:11:10.426 CST [23134]    DEBUG:  redo record is at 8BC/63F5FC0; shutdown TRUE
2010-02-27 20:11:10.426 CST [23134]    DEBUG:  next transaction ID: 0/488483782; next OID: 1619873999
2010-02-27 20:11:10.426 CST [23134]    DEBUG:  next MultiXactId: 130401682; next MultiXactOffset: 830996786
2010-02-27 20:11:10.426 CST [23134]    LOG:  database system was not properly shut down; automatic recovery in progress
2010-02-27 20:11:10.497 CST [23134]    LOG:  record with zero length at 8BC/63F6020
2010-02-27 20:11:10.497 CST [23134]    LOG:  redo is not required
2010-02-27 20:11:10.503 CST [23134]    LOG:  checkpoint starting: shutdown immediate
2010-02-27 20:11:10.581 CST [23134]    LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added,0 removed, 0 recycled; write=0.045 s, sync=0.000 s, total=0.083 s 
2010-02-27 20:11:10.587 CST [23134]    DEBUG:  transaction ID wrap limit is 2435542396, limited by database "ohsdba"
2010-02-27 20:11:10.589 CST [23134]    DEBUG:  proc_exit(0)
2010-02-27 20:11:10.589 CST [23134]    DEBUG:  shmem_exit(0)
2010-02-27 20:11:10.592 CST [23134]    DEBUG:  exit(0)

Here's the backtrace:

$ gdb `which postgres`
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(gdb) attach 3776
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 3776
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
(gdb) p debug_query_string
$1 = 0x0
(gdb) bt
#0  0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x000000346f8689cb in getc () from /lib64/libc.so.6
#4  0x0000000000531ee8 in next_token (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at hba.c:128
#5  0x0000000000532233 in tokenize_file (filename=0x5b8f3f0 "global", file=0x5b90f20, lines=0x7fff59bef5c8,
    line_nums=0x7fff59bef5c0) at hba.c:232
#6  0x00000000005322e9 in tokenize_file (filename=0x5b8f3d0 "global/pg_auth", file=0x5b90ce0, lines=0x98b168,
line_nums=0x98b170)
    at hba.c:358
#7  0x00000000005327ff in load_role () at hba.c:959
#8  0x000000000057f300 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2145
#9  <signal handler called>
#10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
#11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
#12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x5b73fe0) at postmaster.c:1031
#13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 3776

Thanks in advance for any help.

Ed


Re: Hung postmaster (8.3.9)

From
Greg Smith
Date:
Ed L. wrote:
> 2010-02-27 20:11:10.426 CST [23134]    LOG:  database system was not properly shut down; automatic recovery in
progress
> 2010-02-27 20:11:10.497 CST [23134]    LOG:  record with zero length at 8BC/63F6020
> 2010-02-27 20:11:10.497 CST [23134]    LOG:  redo is not required
>

If the server is hung at this point, it was doing something:  recovery
from a crash, and during that period it will not respond to statements
until it's finished as you noted.  That can take some time, could run
into the minutes.  And if you kill it, next time the process starts all
over again having made no progress.

I'd try starting the server again, confirm it's only after this point it
pauses at, watching if it's doing something with "top -c", and grab some
backtraces of what it's doing then.  In this sample you gave, the
shutdown was too fast for the server have done very recovery before it
was killed off.  It may just be you need to let it sit there for a while
to finish recovery and then it will be fine again, can't tell from the
data you've provided so far.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Saturday 27 February 2010 @ 22:18, Greg Smith wrote:
> Ed L. wrote:
> > 2010-02-27 20:11:10.426 CST [23134]    LOG:  database system
> > was not properly shut down; automatic recovery in progress
> > 2010-02-27 20:11:10.497 CST [23134]    LOG:  record with
> > zero length at 8BC/63F6020 2010-02-27 20:11:10.497 CST
> > [23134]    LOG:  redo is not required
>
> If the server is hung at this point, it was doing something:
>  recovery from a crash, and during that period it will not
>  respond to statements until it's finished as you noted.  That
>  can take some time, could run into the minutes.  And if you
>  kill it, next time the process starts all over again having
>  made no progress.
>
> I'd try starting the server again, confirm it's only after
>  this point it pauses at, watching if it's doing something
>  with "top -c", and grab some backtraces of what it's doing
>  then.  In this sample you gave, the shutdown was too fast for
>  the server have done very recovery before it was killed off.
>  It may just be you need to let it sit there for a while to
>  finish recovery and then it will be fine again, can't tell
>  from the data you've provided so far.
>

Good point from my example, but it hung like that for 8 hours,
and seems to be ready to do so again.


Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> (gdb) bt
> #0  0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
> #1  0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
> #2  0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
> #3  0x000000346f8689cb in getc () from /lib64/libc.so.6
> #4  0x0000000000531ee8 in next_token (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at hba.c:128
> #5  0x0000000000532233 in tokenize_file (filename=0x5b8f3f0 "global", file=0x5b90f20, lines=0x7fff59bef5c8,
>     line_nums=0x7fff59bef5c0) at hba.c:232
> #6  0x00000000005322e9 in tokenize_file (filename=0x5b8f3d0 "global/pg_auth", file=0x5b90ce0, lines=0x98b168,
line_nums=0x98b170)
>     at hba.c:358
> #7  0x00000000005327ff in load_role () at hba.c:959
> #8  0x000000000057f300 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2145
> #9  <signal handler called>
> #10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
> #11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
> #12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x5b73fe0) at postmaster.c:1031
> #13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188

The postmaster seems to be stuck trying to read $PGDATA/global/pg_auth
(which would be an expected thing for it to do at this point in the
startup sequence).  Does that file exist?  Is it an ordinary file?
Do its contents look sane (a list of your userids and their passwords
and group memberships)?

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 15:41, Ed Loehr (LoehrTech.com) wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > (gdb) bt
> > #0  0x000000346f8c43a0 in __read_nocancel () from
> > /lib64/libc.so.6 #1  0x000000346f86c747 in
> > _IO_new_file_underflow () from /lib64/libc.so.6 #2
> > 0x000000346f86d10e in _IO_default_uflow_internal () from
> > /lib64/libc.so.6 #3  0x000000346f8689cb in getc () from
> > /lib64/libc.so.6 #4  0x0000000000531ee8 in next_token
> > (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at
> > hba.c:128 #5  0x0000000000532233 in tokenize_file
> > (filename=0x5b8f3f0 "global", file=0x5b90f20,
> > lines=0x7fff59bef5c8, line_nums=0x7fff59bef5c0) at hba.c:232
> > #6  0x00000000005322e9 in tokenize_file (filename=0x5b8f3d0
> > "global/pg_auth", file=0x5b90ce0, lines=0x98b168,
> > line_nums=0x98b170) at hba.c:358
> > #7  0x00000000005327ff in load_role () at hba.c:959
> > #8  0x000000000057f300 in reaper (postgres_signal_arg=<value
> > optimized out>) at postmaster.c:2145 #9  <signal handler
> > called>
> > #10 0x000000346f8cb323 in __select_nocancel () from
> > /lib64/libc.so.6 #11 0x000000000057cc33 in ServerLoop () at
> > postmaster.c:1236 #12 0x000000000057dfdf in PostmasterMain
> > (argc=6, argv=0x5b73fe0) at postmaster.c:1031 #13
> > 0x00000000005373de in main (argc=6, argv=<value optimized
> > out>) at main.c:188
>
> The postmaster seems to be stuck trying to read
>  $PGDATA/global/pg_auth (which would be an expected thing for
>  it to do at this point in the startup sequence).  Does that
>  file exist?  Is it an ordinary file? Do its contents look
>  sane (a list of your userids and their passwords and group
>  memberships)?

This just happened again ~24 hours after full reload from backup.
Arrrgh.

Backtrace looks the same again, same file, same
__read_nocancel().  $PGDATA/global/pg_auth looks fine to me,
permissions are 600, entries are 3 or more double-quoted items
per line each separated by a space, items 3 and beyond being
groups.

Any clues?

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 15:46, Ed L. wrote:
> On Monday 01 March 2010 @ 15:41, Ed Loehr (LoehrTech.com) wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > (gdb) bt
> > > #0  0x000000346f8c43a0 in __read_nocancel () from
> > > /lib64/libc.so.6 #1  0x000000346f86c747 in
> > > _IO_new_file_underflow () from /lib64/libc.so.6 #2
> > > 0x000000346f86d10e in _IO_default_uflow_internal () from
> > > /lib64/libc.so.6 #3  0x000000346f8689cb in getc () from
> > > /lib64/libc.so.6 #4  0x0000000000531ee8 in next_token
> > > (fp=0x5b90f20, buf=0x7fff59bef330 "", bufsz=4096) at
> > > hba.c:128 #5  0x0000000000532233 in tokenize_file
> > > (filename=0x5b8f3f0 "global", file=0x5b90f20,
> > > lines=0x7fff59bef5c8, line_nums=0x7fff59bef5c0) at
> > > hba.c:232 #6  0x00000000005322e9 in tokenize_file
> > > (filename=0x5b8f3d0 "global/pg_auth", file=0x5b90ce0,
> > > lines=0x98b168, line_nums=0x98b170) at hba.c:358
> > > #7  0x00000000005327ff in load_role () at hba.c:959
> > > #8  0x000000000057f300 in reaper
> > > (postgres_signal_arg=<value optimized out>) at
> > > postmaster.c:2145 #9  <signal handler called>
> > > #10 0x000000346f8cb323 in __select_nocancel () from
> > > /lib64/libc.so.6 #11 0x000000000057cc33 in ServerLoop ()
> > > at postmaster.c:1236 #12 0x000000000057dfdf in
> > > PostmasterMain (argc=6, argv=0x5b73fe0) at
> > > postmaster.c:1031 #13 0x00000000005373de in main (argc=6,
> > > argv=<value optimized out>) at main.c:188
> >
> > The postmaster seems to be stuck trying to read
> >  $PGDATA/global/pg_auth (which would be an expected thing
> > for it to do at this point in the startup sequence).  Does
> > that file exist?  Is it an ordinary file? Do its contents
> > look sane (a list of your userids and their passwords and
> > group memberships)?
>
> This just happened again ~24 hours after full reload from
>  backup. Arrrgh.
>
> Backtrace looks the same again, same file, same
> __read_nocancel().  $PGDATA/global/pg_auth looks fine to me,
> permissions are 600, entries are 3 or more double-quoted items
> per line each separated by a space, items 3 and beyond being
> groups.
>
> Any clues?

Watching the server logs, the system is continuing to process
data on existing connections.  Just can't get any new ones.
Here's a backtrace for a hung psql -c "select version()":

$ gdb `which psql`
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(gdb) attach 9631
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/psql, process 9631
Reading symbols from /opt/pgsql/installs/postgresql-8.3.9/lib/libpq.so.5...done.
Loaded symbols for /opt/pgsql/installs/postgresql-8.3.9/lib/libpq.so.5
Reading symbols from /usr/lib64/libz.so.1...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /usr/lib64/libreadline.so.5...done.
Loaded symbols for /usr/lib64/libreadline.so.5
Reading symbols from /lib64/libtermcap.so.2...done.
Loaded symbols for /lib64/libtermcap.so.2
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
0x000000346f8c92af in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x000000346f8c92af in poll () from /lib64/libc.so.6
#1  0x00002b03826e5e6f in pqSocketCheck (conn=0x655eef0, forRead=1, forWrite=0, end_time=-1) at fe-misc.c:1046
#2  0x00002b03826e5f10 in pqWaitTimed (forRead=1, forWrite=-1, conn=0x655eef0, finish_time=-1) at fe-misc.c:920
#3  0x00002b03826e1752 in connectDBComplete (conn=0x655eef0) at fe-connect.c:930
#4  0x00002b03826e2c60 in PQsetdbLogin (pghost=0x0, pgport=0x0, pgoptions=0x0, pgtty=0x0, dbName=0x0, login=0x0,
pwd=0x0)at fe-connect.c:678 
#5  0x000000000040e319 in main (argc=<value optimized out>, argv=0x7fff283ce6e8) at startup.c:195


Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 15:59, Ed L. wrote:
> > This just happened again ~24 hours after full reload from
> >  backup. Arrrgh.
> >
> > Backtrace looks the same again, same file, same
> > __read_nocancel().  $PGDATA/global/pg_auth looks fine to me,
> > permissions are 600, entries are 3 or more double-quoted
> > items per line each separated by a space, items 3 and beyond
> > being groups.
> >
> > Any clues?

Also seeing lots of postmaster zombies (190 and growing)...

Ed
--


Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 16:03, Ed L. wrote:
> On Monday 01 March 2010 @ 15:59, Ed L. wrote:
> > > This just happened again ~24 hours after full reload from
> > >  backup. Arrrgh.
> > >
> > > Backtrace looks the same again, same file, same
> > > __read_nocancel().  $PGDATA/global/pg_auth looks fine to
> > > me, permissions are 600, entries are 3 or more
> > > double-quoted items per line each separated by a space,
> > > items 3 and beyond being groups.
> > >
> > > Any clues?
>
> Also seeing lots of postmaster zombies (190 and growing)...

While new connections are hanging, top shows postmaster using
100% of cpu.  SIGTERM/SIGQUIT do nothing.  Here's a backtrace
of this busy postmaster:

(gdb) bt
#0  0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x000000346f8689cb in getc () from /lib64/libc.so.6
#4  0x0000000000531ee8 in next_token (fp=0x10377ae0, buf=0x7fff32230e60 "", bufsz=4096) at hba.c:128
#5  0x0000000000532233 in tokenize_file (filename=0x10359b70 "global", file=0x10377ae0, lines=0x7fff322310f8,
line_nums=0x7fff322310f0)at hba.c:232 
#6  0x00000000005322e9 in tokenize_file (filename=0x2b1c8cbf5800 "global/pg_auth", file=0x103767a0, lines=0x98b168,
line_nums=0x98b170)at hba.c:358 
#7  0x00000000005327ff in load_role () at hba.c:959
#8  0x000000000057f878 in sigusr1_handler (postgres_signal_arg=<value optimized out>) at postmaster.c:3830
#9  <signal handler called>
#10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
#11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
#12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x1033f000) at postmaster.c:1031
#13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188

...and more from the server logs, fwiw:

2010-03-01 17:30:24.213 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:30:31.250 CST [32236]    DEBUG:  transaction log switch forced (archive_timeout=300)
2010-03-01 17:31:24.216 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:32:24.219 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:33:24.222 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:34:24.225 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:35:19.061 CST [32236]    LOG:  checkpoint starting: time
2010-03-01 17:35:19.185 CST [32236]    DEBUG:  recycled transaction log file "000000010000001C00000071"
2010-03-01 17:35:19.185 CST [32236]    LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added,0 removed, 1 recycled;  
write=0.028 s, sync=0.000 s, total=0.124 s
2010-03-01 17:35:24.328 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:35:31.224 CST [32236]    DEBUG:  transaction log switch forced (archive_timeout=300)
2010-03-01 17:36:44.332 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:37:44.434 CST [32238]    WARNING:  worker took too long to start; cancelled
2010-03-01 17:37:47.378 CST [3692] dba 10....(42816) dba LOG:  could not receive data from client: Connection timed out
2010-03-01 17:37:47.378 CST [3692] dba 10....(42816) dba LOG:  unexpected EOF on client connection
2010-03-01 17:37:47.380 CST [3692] dba 10....(42816) dba LOG:  disconnection: session time: 2:11:15.303 user=dba
database=dbahost=... port=428 

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> Also seeing lots of postmaster zombies (190 and growing)...

Yeah, that and the lack of service for new connections would both be
expected if the postmaster is stuck.  And the autovac worker start
failures, too.  There's only one bug here.

> While new connections are hanging, top shows postmaster using
> 100% of cpu.

Oh, for some reason I thought it was sitting idle.  That sounds more
like an infinite loop.  Try reattaching to the postmaster, confirm the
stack trace, and then see how many times you can do "fin" before it
doesn't return control.  That will tell us which level of subroutine is
looping.

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> #4  0x0000000000531ee8 in next_token (fp=0x10377ae0, buf=0x7fff32230e60 "", bufsz=4096) at hba.c:128
> #5  0x0000000000532233 in tokenize_file (filename=0x10359b70 "global", file=0x10377ae0, lines=0x7fff322310f8,
line_nums=0x7fff322310f0)at hba.c:232 
> #6  0x00000000005322e9 in tokenize_file (filename=0x2b1c8cbf5800 "global/pg_auth", file=0x103767a0, lines=0x98b168,
line_nums=0x98b170)at hba.c:358 
> #7  0x00000000005327ff in load_role () at hba.c:959

Now that I look more closely at those line numbers, it looks like the
thing thinks it is processing an include file.  Are there any @ signs
in your global/pg_auth file?

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 16:57, Tom Lane wrote:>
> Now that I look more closely at those line numbers, it looks
>  like the thing thinks it is processing an include file.  Are
>  there any @ signs in your global/pg_auth file?

Yes, indeed, there are many.  My user names are "user@host" form,
and have been for years.  Would that be a problem now?

Ed

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> On Monday 01 March 2010 @ 16:57, Tom Lane wrote:>
>> Now that I look more closely at those line numbers, it looks
>> like the thing thinks it is processing an include file.  Are
>> there any @ signs in your global/pg_auth file?

> Yes, indeed, there are many.  My user names are "user@host" form,
> and have been for years.  Would that be a problem now?

user@host shouldn't be a problem, but if there were an @ by itself or
starting a token, it might possibly cause something like this.

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:15, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Monday 01 March 2010 @ 16:57, Tom Lane wrote:>
> >
> >> Now that I look more closely at those line numbers, it
> >> looks like the thing thinks it is processing an include
> >> file.  Are there any @ signs in your global/pg_auth file?
> >
> > Yes, indeed, there are many.  My user names are "user@host"
> > form, and have been for years.  Would that be a problem now?
>
> user@host shouldn't be a problem, but if there were an @ by
>  itself or starting a token, it might possibly cause something
>  like this.

There is one, looks like a typo got in.  How do I fix it?

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> On Monday 01 March 2010 @ 17:15, Tom Lane wrote:
>> user@host shouldn't be a problem, but if there were an @ by
>> itself or starting a token, it might possibly cause something
>> like this.

> There is one, looks like a typo got in.  How do I fix it?

Hmm, a user named @, or what?

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 16:49, Tom Lane wrote:
>
> Oh, for some reason I thought it was sitting idle.  That
>  sounds more like an infinite loop.  Try reattaching to the
>  postmaster, confirm the stack trace, and then see how many
>  times you can do "fin" before it doesn't return control.
>  That will tell us which level of subroutine is looping.

Yeah, it took me a bit to notice it was at 100% cpu.  Here's
gdb output with the 'fin's you suggested:

$ gdb `which postgres`
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(gdb) attach 21670
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 21670
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libm.so.6...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
(gdb) bt
#0  0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x000000346f8689cb in getc () from /lib64/libc.so.6
#4  0x0000000000531ee8 in next_token (fp=0x11da7f30, buf=0x7fff5cb9e300 "", bufsz=4096) at hba.c:128
#5  0x0000000000532233 in tokenize_file (filename=0x11da5940 "global", file=0x11da7f30, lines=0x7fff5cb9e598,
line_nums=0x7fff5cb9e590)at hba.c:232 
#6  0x00000000005322e9 in tokenize_file (filename=0x11da5920 "global/pg_auth", file=0x11da7cf0, lines=0x98b168,
line_nums=0x98b170)at hba.c:358 
#7  0x00000000005327ff in load_role () at hba.c:959
#8  0x000000000057f300 in reaper (postgres_signal_arg=<value optimized out>) at postmaster.c:2145
#9  <signal handler called>
#10 0x000000346f8cb323 in __select_nocancel () from /lib64/libc.so.6
#11 0x000000000057cc33 in ServerLoop () at postmaster.c:1236
#12 0x000000000057dfdf in PostmasterMain (argc=6, argv=0x11d8afb0) at postmaster.c:1031
#13 0x00000000005373de in main (argc=6, argv=<value optimized out>) at main.c:188
(gdb) fin
Run till exit from #0  0x000000346f8c43a0 in __read_nocancel () from /lib64/libc.so.6
0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
(gdb) fin
Run till exit from #0  0x000000346f86c747 in _IO_new_file_underflow () from /lib64/libc.so.6
0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
(gdb) fin
Run till exit from #0  0x000000346f86d10e in _IO_default_uflow_internal () from /lib64/libc.so.6
0x000000346f8689cb in getc () from /lib64/libc.so.6
(gdb) fin
Run till exit from #0  0x000000346f8689cb in getc () from /lib64/libc.so.6
0x0000000000531ee8 in next_token (fp=0x11da7f30, buf=0x7fff5cb9e300 "", bufsz=<value optimized out>) at hba.c:128
128        while ((c = getc(fp)) != EOF && (pg_isblank(c) || c == ','))
(gdb) fin
Run till exit from #0  0x0000000000531ee8 in next_token (fp=0x11da7f30, buf=0x7fff5cb9e300 "", bufsz=<value optimized
out>)at hba.c:128 
0x0000000000532233 in tokenize_file (filename=0x11da5940 "global", file=0x11da7f30, lines=0x7fff5cb9e598,
line_nums=0x7fff5cb9e590)at hba.c:232 
232            if (!next_token(file, buf, sizeof(buf)))
Value returned is $1 = 0 '\0'
(gdb) fin
Run till exit from #0  0x0000000000532233 in tokenize_file (filename=0x11da5940 "global", file=0x11da7f30,
lines=0x7fff5cb9e598, 
line_nums=0x7fff5cb9e590) at hba.c:232



Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Monday 01 March 2010 @ 17:15, Tom Lane wrote:
> >> user@host shouldn't be a problem, but if there were an @ by
> >> itself or starting a token, it might possibly cause
> >> something like this.
> >
> > There is one, looks like a typo got in.  How do I fix it?
>
> Hmm, a user named @, or what?

Yes, a bogus user:

"@" "" ""

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:23, Ed L. wrote:
> On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > On Monday 01 March 2010 @ 17:15, Tom Lane wrote:
> > >> user@host shouldn't be a problem, but if there were an @
> > >> by itself or starting a token, it might possibly cause
> > >> something like this.
> > >
> > > There is one, looks like a typo got in.  How do I fix it?
> >
> > Hmm, a user named @, or what?
>
> Yes, a bogus user:
>
> "@" "" ""
>

Correction.  Here's the line:

"@" "" "" "agent_group"


Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
>> "Ed L." <pgsql@bluepolka.net> writes:
>>> There is one, looks like a typo got in.  How do I fix it?
>>
>> Hmm, a user named @, or what?

> Yes, a bogus user:

> "@" "" ""

Mph.  We really ought to fix things so that a quoted @ doesn't get taken
as an include file reference.  Odd that it's never come up before.

Anyway, if you still have any open superuser sessions, the best thing
would be an ALTER USER RENAME.  If you don't, you'll have to resort to
manually editing the pg_auth file, and then rename the user as soon as
you can get in.

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:25, Ed L. wrote:
> On Monday 01 March 2010 @ 17:23, Ed L. wrote:
> > On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
> > > "Ed L." <pgsql@bluepolka.net> writes:
> > > > On Monday 01 March 2010 @ 17:15, Tom Lane wrote:
> > > >> user@host shouldn't be a problem, but if there were an
> > > >> @ by itself or starting a token, it might possibly
> > > >> cause something like this.
> > > >
> > > > There is one, looks like a typo got in.  How do I fix
> > > > it?
> > >
> > > Hmm, a user named @, or what?
> >
> > Yes, a bogus user:
> >
> > "@" "" ""
>
> Correction.  Here's the line:
>
> "@" "" "" "agent_group"

It is the first line in the pg_auth file.

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:26, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
> >> "Ed L." <pgsql@bluepolka.net> writes:
> >>> There is one, looks like a typo got in.  How do I fix it?
> >>
> >> Hmm, a user named @, or what?
> >
> > Yes, a bogus user:
> >
> > "@" "" ""
>
> Mph.  We really ought to fix things so that a quoted @ doesn't
>  get taken as an include file reference.  Odd that it's never
>  come up before.
>
> Anyway, if you still have any open superuser sessions, the
>  best thing would be an ALTER USER RENAME.  If you don't,
>  you'll have to resort to manually editing the pg_auth file,
>  and then rename the user as soon as you can get in.

Killed the stuck postmaster with sigkill, edited the file,
restarted postmaster, and it re-wrote the file with the bogus
entry.  I don't have any superuser sessions open.  Is there
another route?

Ed

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:36, Ed L. wrote:
> On Monday 01 March 2010 @ 17:26, Tom Lane wrote:
> > "Ed L." <pgsql@bluepolka.net> writes:
> > > On Monday 01 March 2010 @ 17:18, Tom Lane wrote:
> > >> "Ed L." <pgsql@bluepolka.net> writes:
> > >>> There is one, looks like a typo got in.  How do I fix
> > >>> it?
> > >>
> > >> Hmm, a user named @, or what?
> > >
> > > Yes, a bogus user:
> > >
> > > "@" "" ""
> >
> > Mph.  We really ought to fix things so that a quoted @
> > doesn't get taken as an include file reference.  Odd that
> > it's never come up before.
> >
> > Anyway, if you still have any open superuser sessions, the
> >  best thing would be an ALTER USER RENAME.  If you don't,
> >  you'll have to resort to manually editing the pg_auth file,
> >  and then rename the user as soon as you can get in.
>
> Killed the stuck postmaster with sigkill, edited the file,
> restarted postmaster, and it re-wrote the file with the bogus
> entry.  I don't have any superuser sessions open.  Is there
> another route?

I have source code and can patch and rebuild/reinstall if I had
the right patch.

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
>> Correction.  Here's the line:
>> "@" "" "" "agent_group"

> It is the first line in the pg_auth file.

BTW, there seems to be some other contributing factor here besides
the weird username, because I don't see any looping when I try
CREATE USER "@".  What's your platform exactly, and what type of
filesystem is $PGDATA on?

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> Killed the stuck postmaster with sigkill, edited the file,
> restarted postmaster, and it re-wrote the file with the bogus
> entry.  I don't have any superuser sessions open.  Is there
> another route?

What you're going to need to do is stop the postmaster, start
a standalone backend (see the "postgres" man page entry if you
never did that before) and issue the ALTER USER in the standalone
backend.  Then you can restart normal operations.

            regards, tom lane

[SOLVED] Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:58, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > Killed the stuck postmaster with sigkill, edited the file,
> > restarted postmaster, and it re-wrote the file with the
> > bogus entry.  I don't have any superuser sessions open.  Is
> > there another route?
>
> What you're going to need to do is stop the postmaster, start
> a standalone backend (see the "postgres" man page entry if you
> never did that before) and issue the ALTER USER in the
>  standalone backend.  Then you can restart normal operations.

That did the trick.  Thank you very much, Sensei.

Ed

Re: [SOLVED] Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> That did the trick.  Thank you very much, Sensei.

I'd still like to know about platform etc.  I see that we shouldn't be
allowing a username to trigger @-file expansion, but even with that
it's not clear how it turned into an infinite loop; and as I said,
I can't reproduce that here.

            regards, tom lane

Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 17:57, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> >> Correction.  Here's the line:
> >> "@" "" "" "agent_group"
> >
> > It is the first line in the pg_auth file.
>
> BTW, there seems to be some other contributing factor here
>  besides the weird username, because I don't see any looping
>  when I try CREATE USER "@".  What's your platform exactly,
>  and what type of filesystem is $PGDATA on?

This is CentOS 5.2, Linux 2.6.18-92.1.22.el5 #1 SMP x86_64
GNU/Linux.  I think the PGDATA filesystem is LVM running on a
RAID 5 setup, but not 100% sure.

# df
Filesystem           1K-blocks      Used Available Use% Mounted
on
/dev/mapper/VolGroup00-LogVol00
                     267422192 132412984 121205688  53% /
...

Re: [SOLVED] Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 18:29, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > That did the trick.  Thank you very much, Sensei.
>
> I'd still like to know about platform etc.  I see that we
>  shouldn't be allowing a username to trigger @-file expansion,
>  but even with that it's not clear how it turned into an
>  infinite loop; and as I said, I can't reproduce that here.

The original problem occurred on that CentOS box, but I just
reproduced it on a new cluster on Ubuntu running LVM but no
RAID.  Here's the transcript:

ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux

ed@duke:~/test$ export PGPORT=9000
ed@duke:~/test$ w^C
ed@duke:~/test$ svi
ed@duke:~/test$ export PGDATA=`pwd`/data
ed@duke:~/test$ initdb --locale=C -D $PGDATA
The files belonging to this database system will be owned by user "ed".
This user must also own the server process.

The database cluster will be initialized with locale C.
The default database encoding has accordingly been set to SQL_ASCII.
The default text search configuration will be set to "english".

creating directory /home/ed/test/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers/max_fsm_pages ... 32MB/204800
creating configuration files ... ok
creating template1 database in /home/ed/test/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.

Success. You can now start the database server using:

    postgres -D /home/ed/test/data
or
    pg_ctl -D /home/ed/test/data -l logfile start

ed@duke:~/test$ pg_ctl -D /home/ed/test/data -l logfile start
server starting
ed@duke:~/test$ psql -c "select version()"
psql: FATAL:  database "ed" does not exist
ed@duke:~/test$ create^C
ed@duke:~/test$ createdb
ed@duke:~/test$ psql -c "select version()"
                                             version
-------------------------------------------------------------------------------------------------
 PostgreSQL 8.3.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
(1 row)

ed@duke:~/test$ createuser "@" with login
createuser: too many command-line arguments (first is "with")
Try "createuser --help" for more information.
ed@duke:~/test$ psql -c "create user \"@\" with login"
CREATE ROLE
ed@duke:~/test$ cat data/global/
1136         1214         1260         2396         2672         2694         2698         2844         2847
pg_database  
1137         1232         1261         2397         2676         2695         2842         2845         pg_auth
pgstat.stat  
1213         1233         1262         2671         2677         2697         2843         2846         pg_control
ed@duke:~/test$ cat data/global/pg_auth
"@" "" ""
"ed" "" ""
ed@duke:~/test$ pg_ctl -D $PGDATA stop
waiting for server to shut down...........................................^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ^C
ed@duke:~/test$ pg_ctl -D $PGDATA -m fast stop
waiting for server to shut down.............^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ps augxwwwf | grep postma
ed        8419  0.0  0.0   7336   868 pts/4    S+   18:36   0:00      \_ grep postma
ohsdba    6115  0.0  0.3 503692 28028 pts/0    S    18:03   0:00 postmaster -D
/users/ohsdba/dbclusters/ohs/postgresql-8.3.x/data-i -p 9001 
ed@duke:~/test$ gdb `which postgre^C
ed@duke:~/test$ pg_ctl -D /home/ed/test/data -l logfile start
pg_ctl: another server might be running; trying to start server anyway
pg_ctl: could not start server
Examine the log output.
ed@duke:~/test$ cat logfile
LOG:  database system was shut down at 2010-03-01 18:34:25 MST
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
FATAL:  database "ed" does not exist
WARNING:  worker took too long to start; cancelled
FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 8203) running in data directory "/home/ed/test/data"?
ed@duke:~/test$ ps augxwwwf | grep 8203
ed        8465  0.0  0.0   7336   868 pts/4    S+   18:37   0:00      \_ grep 8203
ed        8203 73.7  0.0  63080  4248 pts/4    R    18:34   2:07 /opt/pgsql/installs/postgresql-8.3.9/bin/postgres -D
/home/ed/test/data
ed@duke:~/test$ gdb `which postgres`
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/pgsql/installs/postgresql-8.3.9/bin/postgres...done.
(gdb) attach 8203
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 8203
Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
0x00007f3c58f9ea30 in read () from /lib/libc.so.6
(gdb) bt
#0  0x00007f3c58f9ea30 in read () from /lib/libc.so.6
#1  0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
#2  0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
#3  0x00007f3c58f3741b in getc () from /lib/libc.so.6
#4  0x00000000005460d8 in T.136 ()
#5  0x00000000005466e4 in tokenize_file ()
#6  0x00000000005467ed in tokenize_file ()
#7  0x0000000000546cef in load_role ()
#8  0x00000000005937ed in sigusr1_handler ()
#9  <signal handler called>
#10 0x00007f3c58fa5373 in select () from /lib/libc.so.6
#11 0x0000000000591cc2 in ServerLoop ()
#12 0x0000000000594467 in PostmasterMain ()
#13 0x000000000054a9b0 in main ()
(gdb) fin
Run till exit from #0  0x00007f3c58f9ea30 in read () from /lib/libc.so.6
0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
0x00007f3c58f3741b in getc () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f3741b in getc () from /lib/libc.so.6
0x00000000005460d8 in T.136 ()
(gdb) fin
Run till exit from #0  0x00000000005460d8 in T.136 ()
0x00000000005466e4 in tokenize_file ()
(gdb) fin
Run till exit from #0  0x00000000005466e4 in tokenize_file ()

(Infinite loop here...)

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb)
The program is not running.
(gdb)
The program is not running.
(gdb) quit
ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux
ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux
ed@duke:~/test$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/duke-root
                     468727296 229717260 215199988  52% /
udev                   4096736       320   4096416   1% /dev
none                   4096736      1280   4095456   1% /dev/shm
none                   4096736       156   4096580   1% /var/run
none                   4096736         0   4096736   0% /var/lock
none                   4096736         0   4096736   0% /lib/init/rw
/dev/sda5               233335     47309    173578  22% /boot
/dev/sr0                   388       388         0 100% /media/cdrw

Re: Hung postmaster (8.3.9)

From
Tom Lane
Date:
"Ed L." <pgsql@bluepolka.net> writes:
> On Monday 01 March 2010 @ 17:57, Tom Lane wrote:
>> BTW, there seems to be some other contributing factor here
>> besides the weird username, because I don't see any looping
>> when I try CREATE USER "@".  What's your platform exactly,
>> and what type of filesystem is $PGDATA on?

> This is CentOS 5.2, Linux 2.6.18-92.1.22.el5 #1 SMP x86_64
> GNU/Linux.

Ah.  I can reproduce it on my Fedora box.  The infinite loop is because
feof never returns 1 when reading from a directory.  I think this is a
glibc bug and have filed it accordingly:
https://bugzilla.redhat.com/show_bug.cgi?id=569697
but IME the glibc boys can be pretty stubborn about acknowledging that
corner cases in their code are actually bugs.  We shall see.

In the meantime, it seems like we ought to take two defensive steps:
prevent a quoted @ from being considered as an include introducer,
and prevent @ with no additional text from being considered as an
inclusion reference no matter what.  What the current code is doing
is seeing "@" as an include file reference with empty include name,
and by the time canonicalize_file is done with it this ends up as
a reference to the $PGDATA/global directory itself.  Which Fedora
allows us to open and read, but it reads as an infinite sequence
of EOF characters because feof never succeeds.

            regards, tom lane

Re: [SOLVED] Re: Hung postmaster (8.3.9)

From
"Ed L."
Date:
On Monday 01 March 2010 @ 18:29, Tom Lane wrote:
> "Ed L." <pgsql@bluepolka.net> writes:
> > That did the trick.  Thank you very much, Sensei.
>
> I'd still like to know about platform etc.  I see that we
>  shouldn't be allowing a username to trigger @-file expansion,
>  but even with that it's not clear how it turned into an
>  infinite loop; and as I said, I can't reproduce that here.

The original problem occurred on that CentOS box, but I just
reproduced it on a new cluster on Ubuntu running LVM but no
RAID.  Here's the transcript:

ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux

ed@duke:~/test$ export PGPORT=9000
ed@duke:~/test$ w^C
ed@duke:~/test$ svi
ed@duke:~/test$ export PGDATA=`pwd`/data
ed@duke:~/test$ initdb --locale=C -D $PGDATA
The files belonging to this database system will be owned by user "ed".
This user must also own the server process.

The database cluster will be initialized with locale C.
The default database encoding has accordingly been set to SQL_ASCII.
The default text search configuration will be set to "english".

creating directory /home/ed/test/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers/max_fsm_pages ... 32MB/204800
creating configuration files ... ok
creating template1 database in /home/ed/test/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.

Success. You can now start the database server using:

    postgres -D /home/ed/test/data
or
    pg_ctl -D /home/ed/test/data -l logfile start

ed@duke:~/test$ pg_ctl -D /home/ed/test/data -l logfile start
server starting
ed@duke:~/test$ psql -c "select version()"
psql: FATAL:  database "ed" does not exist
ed@duke:~/test$ create^C
ed@duke:~/test$ createdb
ed@duke:~/test$ psql -c "select version()"
                                             version
-------------------------------------------------------------------------------------------------
 PostgreSQL 8.3.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
(1 row)

ed@duke:~/test$ createuser "@" with login
createuser: too many command-line arguments (first is "with")
Try "createuser --help" for more information.
ed@duke:~/test$ psql -c "create user \"@\" with login"
CREATE ROLE
ed@duke:~/test$ cat data/global/
1136         1214         1260         2396         2672         2694         2698         2844         2847
pg_database  
1137         1232         1261         2397         2676         2695         2842         2845         pg_auth
pgstat.stat  
1213         1233         1262         2671         2677         2697         2843         2846         pg_control
ed@duke:~/test$ cat data/global/pg_auth
"@" "" ""
"ed" "" ""
ed@duke:~/test$ pg_ctl -D $PGDATA stop
waiting for server to shut down...........................................^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ^C
ed@duke:~/test$ pg_ctl -D $PGDATA -m fast stop
waiting for server to shut down.............^C
ed@duke:~/test$ ^C
ed@duke:~/test$ ps augxwwwf | grep postma
ed        8419  0.0  0.0   7336   868 pts/4    S+   18:36   0:00      \_ grep postma
ohsdba    6115  0.0  0.3 503692 28028 pts/0    S    18:03   0:00 postmaster -D
/users/ohsdba/dbclusters/ohs/postgresql-8.3.x/data-i -p 9001 
ed@duke:~/test$ gdb `which postgre^C
ed@duke:~/test$ pg_ctl -D /home/ed/test/data -l logfile start
pg_ctl: another server might be running; trying to start server anyway
pg_ctl: could not start server
Examine the log output.
ed@duke:~/test$ cat logfile
LOG:  database system was shut down at 2010-03-01 18:34:25 MST
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
FATAL:  database "ed" does not exist
WARNING:  worker took too long to start; cancelled
FATAL:  lock file "postmaster.pid" already exists
HINT:  Is another postmaster (PID 8203) running in data directory "/home/ed/test/data"?
ed@duke:~/test$ ps augxwwwf | grep 8203
ed        8465  0.0  0.0   7336   868 pts/4    S+   18:37   0:00      \_ grep 8203
ed        8203 73.7  0.0  63080  4248 pts/4    R    18:34   2:07 /opt/pgsql/installs/postgresql-8.3.9/bin/postgres -D
/home/ed/test/data
ed@duke:~/test$ gdb `which postgres`
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/pgsql/installs/postgresql-8.3.9/bin/postgres...done.
(gdb) attach 8203
Attaching to program: /opt/pgsql/installs/postgresql-8.3.9/bin/postgres, process 8203
Reading symbols from /lib/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libcrypt.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_files.so.2
0x00007f3c58f9ea30 in read () from /lib/libc.so.6
(gdb) bt
#0  0x00007f3c58f9ea30 in read () from /lib/libc.so.6
#1  0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
#2  0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
#3  0x00007f3c58f3741b in getc () from /lib/libc.so.6
#4  0x00000000005460d8 in T.136 ()
#5  0x00000000005466e4 in tokenize_file ()
#6  0x00000000005467ed in tokenize_file ()
#7  0x0000000000546cef in load_role ()
#8  0x00000000005937ed in sigusr1_handler ()
#9  <signal handler called>
#10 0x00007f3c58fa5373 in select () from /lib/libc.so.6
#11 0x0000000000591cc2 in ServerLoop ()
#12 0x0000000000594467 in PostmasterMain ()
#13 0x000000000054a9b0 in main ()
(gdb) fin
Run till exit from #0  0x00007f3c58f9ea30 in read () from /lib/libc.so.6
0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f3ead8 in _IO_file_underflow () from /lib/libc.so.6
0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f4065e in _IO_default_uflow () from /lib/libc.so.6
0x00007f3c58f3741b in getc () from /lib/libc.so.6
(gdb) fin
Run till exit from #0  0x00007f3c58f3741b in getc () from /lib/libc.so.6
0x00000000005460d8 in T.136 ()
(gdb) fin
Run till exit from #0  0x00000000005460d8 in T.136 ()
0x00000000005466e4 in tokenize_file ()
(gdb) fin
Run till exit from #0  0x00000000005466e4 in tokenize_file ()

(Infinite loop here...)

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb)
The program is not running.
(gdb)
The program is not running.
(gdb) quit
ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux
ed@duke:~/test$ uname -a
Linux duke 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:53:52 UTC 2009 x86_64 GNU/Linux
ed@duke:~/test$ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/duke-root
                     468727296 229717260 215199988  52% /
udev                   4096736       320   4096416   1% /dev
none                   4096736      1280   4095456   1% /dev/shm
none                   4096736       156   4096580   1% /var/run
none                   4096736         0   4096736   0% /var/lock
none                   4096736         0   4096736   0% /lib/init/rw
/dev/sda5               233335     47309    173578  22% /boot
/dev/sr0                   388       388         0 100% /media/cdrw

Re: Hung postmaster (8.3.9)

From
Greg Stark
Date:
We should probably also check and prohibit including directories as files.

On Tuesday, March 2, 2010, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> In the meantime, it seems like we ought to take two defensive steps:


--
greg