Thread: BUG #3902: Segmentation faults using GSSAPI authentication

BUG #3902: Segmentation faults using GSSAPI authentication

From
"Peter Koczan"
Date:
The following bug has been logged online:

Bug reference:      3902
Logged by:          Peter Koczan
Email address:      pjkoczan@gmail.com
PostgreSQL version: 8.3RC2
Operating system:   Red Hat Enterprise Linux 5
Description:        Segmentation faults using GSSAPI authentication
Details:

Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism)
after the server has been under a very slight load results in an unusable
but still running server.

I was able to produce this problem by trying to connect to the database, run
a few commands, disconnect, and trying to reconnect. I was also able to
produce this when trying to connect via GSSAPI to a server when it was
restoring a database (using local trust authentication).

The syslog output shows many of these errors when trying to connect from one
such server process:

Jan 24 15:46:17 mitchell postgres[14663]: [1-1] LOG:  could not load root
certificate file "root.crt": No such file or directory
Jan 24 15:46:17 mitchell postgres[14663]: [1-2] DETAIL:  Will not verify
client certificates.
Jan 24 15:46:17 mitchell postgres[14663]: [2-1] LOG:  database system is
ready to accept connections
Jan 24 16:16:18 mitchell postgres[14663]: [3-1] LOG:  received SIGHUP,
reloading configuration files
Jan 24 16:18:31 mitchell postgres[14663]: [4-1] LOG:  server process (PID
15004) was terminated by signal 11: Segmentation fault
Jan 24 16:18:31 mitchell postgres[14663]: [5-1] LOG:  terminating any other
active server processes
Jan 24 16:18:31 mitchell postgres[14663]: [6-1] LOG:  all server processes
terminated; reinitializing
Jan 24 16:18:31 mitchell postgres[14663]: [7-1] LOG:  database system is
ready to accept connections
Jan 24 16:48:49 mitchell postgres[14663]: [8-1] LOG:  received smart
shutdown request

The client connection shows this:
$ /s/postgresql-8.3-RC2/bin/psql -h mitchell -p 5434
psql: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Syslog for a connection shows this:
Jan 24 16:16:59 mitchell postgres[15004]: [4-1] LOG:  connection received:
host=ator.cs.wisc.edu port=35608
Jan 24 16:16:59 mitchell postgres[15004]: [5-1] LOG:  connection authorized:
user=koczan database=postgres


It is important to note that connecting via native Kerberos 5 does not cause
this problem, nor does going via md5 or trust authentication.

I compiled against Kerberos 5 - 1.6.2

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes:
> Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism)
> after the server has been under a very slight load results in an unusable
> but still running server.

Can you get us a stack trace from that crash?

            regards, tom lane

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes:
> Looking into it more, it looks like the server is restarting every
> time it encounters this. I was wrong thinking that it stayed crashed,
> I guess I was just looking at a stale connection.

I think you misunderstand what is happening.  The parent postmaster
process is not restarting, because it did not crash.  The crash is
happening in a child process that is forked off by the postmaster to
service a particular connection.

> Since the server restarts and any connections either go away forever
> or just reset, post-crash stack traces won't do much good.

No, we want a stack trace from the crash.  You could attach gdb to the
child process after forking and before you provoke the crash; or arrange
for a core dump file to be produced and gdb that.

            regards, tom lane

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
"Peter Koczan"
Date:
On Jan 25, 2008 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Can you get us a stack trace from that crash?

Here's the trace of the server process post-crash.

[root@mitchell ~]# gdb -p 24078
(gdb) bt
#0  0x006ac402 in __kernel_vsyscall ()
#1  0x0060801d in ___newselect_nocancel () from /lib/libc.so.6
#2  0x0820db22 in ServerLoop ()
#3  0x0820d631 in PostmasterMain ()
#4  0x081b2ee7 in main ()

Looking into it more, it looks like the server is restarting every
time it encounters this. I was wrong thinking that it stayed crashed,
I guess I was just looking at a stale connection.

However, I did find a way to trigger this behavior, and it still only
happens with GSSAPI auth.

Basically, run a statement with a syntax error while logged in under
GSSAPI (I think anything that generates an ERROR level message will
work).

Here's the transcript of a GSSAPI connection:
postgres=> select current_user;
 current_user
--------------
 nobody
(1 row)

postgres=> select * from pg_database 1;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> select * from pg_database 1;
You are currently not connected to a database.
!>

And of an md5-based connection (for user nobody, note that native krb5
connections exhibit this same behavior):
postgres=> select * from pg_database 1;
ERROR:  syntax error at or near "1"
LINE 1: select * from pg_database 1;
                                  ^
postgres=> select current_user;
 current_user
--------------
 nobody
(1 row)

postgres=> select * from pg_database 1;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
postgres=>

And the syslog entries show this:
Jan 26 19:37:38 mitchell postgres[24586]: [12-1] LOG:  connection
received: host=ator.cs.wisc.edu port=59442
Jan 26 19:37:38 mitchell postgres[24586]: [13-1] LOG:  connection
authorized: user=nobody database=postgres
Jan 26 19:37:47 mitchell postgres[24587]: [12-1] LOG:  connection
received: host=ator.cs.wisc.edu port=59443
Jan 26 19:37:47 mitchell postgres[24587]: [13-1] LOG:  connection
authorized: user=koczan database=postgres
Jan 26 19:38:04 mitchell postgres[24586]: [14-1] ERROR:  syntax error
at or near "1" at character 27
Jan 26 19:38:04 mitchell postgres[24586]: [14-2] STATEMENT:  select *
from pg_database 1;
Jan 26 19:38:11 mitchell postgres[24078]: [12-1] LOG:  server process
(PID 24587) was terminated by signal 11: Segmentation fault
Jan 26 19:38:11 mitchell postgres[24078]: [13-1] LOG:  terminating any
other active server processes
Jan 26 19:38:11 mitchell postgres[24586]: [15-1] WARNING:  terminating
connection because of crash of another server process
Jan 26 19:38:11 mitchell postgres[24586]: [15-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server
Jan 26 19:38:11 mitchell postgres[24586]: [15-3]  process exited
abnormally and possibly corrupted shared memory.
Jan 26 19:38:11 mitchell postgres[24586]: [15-4] HINT:  In a moment
you should be able to reconnect to the database and repeat your
command.
Jan 26 19:38:11 mitchell postgres[24078]: [14-1] LOG:  all server
processes terminated; reinitializing
Jan 26 19:38:11 mitchell postgres[24607]: [15-1] LOG:  database system
was interrupted; last known up at 2008-01-26 19:25:52 CST
Jan 26 19:38:11 mitchell postgres[24607]: [16-1] LOG:  database system
was not properly shut down; automatic recovery in progress
Jan 26 19:38:11 mitchell postgres[24608]: [15-1] LOG:  connection
received: host=ator.cs.wisc.edu port=59446
Jan 26 19:38:11 mitchell postgres[24607]: [17-1] LOG:  record with
zero length at 0/87A5DC
Jan 26 19:38:11 mitchell postgres[24607]: [18-1] LOG:  redo is not required
Jan 26 19:38:11 mitchell postgres[24607]: [19-1] LOG:  checkpoint
starting: shutdown immediate
Jan 26 19:38:11 mitchell postgres[24607]: [20-1] LOG:  checkpoint
complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0
removed, 0 recycled; write=0.002 s,
Jan 26 19:38:11 mitchell postgres[24607]: [20-2]  sync=0.000 s, total=0.009 s
Jan 26 19:38:11 mitchell postgres[24611]: [15-1] LOG:  autovacuum
launcher started
Jan 26 19:38:11 mitchell postgres[24078]: [15-1] LOG:  database system
is ready to accept connections
Jan 26 19:38:11 mitchell postgres[24608]: [16-1] FATAL:  the database
system is in recovery mode
Jan 26 19:38:11 mitchell postgres[24613]: [16-1] LOG:  connection
received: host=ator.cs.wisc.edu port=59447
Jan 26 19:38:11 mitchell postgres[24613]: [17-1] FATAL:  no
pg_hba.conf entry for host "128.105.162.36", user "koczan", database
"postgres", SSL off
Jan 26 19:38:22 mitchell postgres[24615]: [16-1] LOG:  connection
received: host=ator.cs.wisc.edu port=59448
Jan 26 19:38:22 mitchell postgres[24615]: [17-1] LOG:  connection
authorized: user=nobody database=postgres

Since the server restarts and any connections either go away forever
or just reset, post-crash stack traces won't do much good. If there's
a way to get a stack-trace at the point of the crash, please let me
know how and I'll get you that information.

Also, why the connection is trying to reconnect without SSL intrigues
and concerns me.

Peter

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes:
> Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism)
> after the server has been under a very slight load results in an unusable
> but still running server.

I couldn't reproduce this at all on Fedora 8.  I speculate that you've
not told us enough about the configuration you're using.  Given that the
crash doesn't happen until after "connection authorized" is logged, it
seems likely that the problem is not purely GSSAPI's fault but is an
interaction with some other option that I happen not to be using.
Please show your configure options and all non-default postgresql.conf
settings.

> The client connection shows this:
> $ /s/postgresql-8.3-RC2/bin/psql -h mitchell -p 5434
> psql: server closed the connection unexpectedly

BTW, I couldn't get GSSAPI to work at all without a fully-specified -h
option; it kept trying to use the wrong Kerberos principal names.
I wonder whether that is related --- how did you persuade it to do the
above?  Special sauce in krb5.conf maybe?

            regards, tom lane

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
"Peter Koczan"
Date:
> I think you misunderstand what is happening.  The parent postmaster
> process is not restarting, because it did not crash.  The crash is
> happening in a child process that is forked off by the postmaster to
> service a particular connection.

I see what you're saying, that it's one particular connection issue
causing these problems. I thought it was a postmaster issue since the
database went into recovery mode for a bit and *all* connections
attempted to reset.

> No, we want a stack trace from the crash.  You could attach gdb to the
> child process after forking and before you provoke the crash; or arrange
> for a core dump file to be produced and gdb that.

I should be able to get that for you tomorrow, if we don't figure out
the issue sooner.

> I couldn't reproduce this at all on Fedora 8.  I speculate that you've
> not told us enough about the configuration you're using.  Given that the
> crash doesn't happen until after "connection authorized" is logged, it
> seems likely that the problem is not purely GSSAPI's fault but is an
> interaction with some other option that I happen not to be using.
> Please show your configure options and all non-default postgresql.conf
> settings.

postgresql.conf entries (uncommented ones):
listen_addresses = '*'
port = 5434
max_connections = 300
unix_socket_directory = '/tmp'
ssl = on
krb_server_keyfile = '/etc/v5srvtab.postgres'
work_mem = 10MB
shared_buffers = 250MB
max_fsm_pages = 250000
max_fsm_relations = 5000
default_statistics_target = 100
log_destination = 'syslog'
syslog_facility = 'LOCAL1'
log_checkpoints = on
log_connections = on
log_disconnections = on
log_duration = off
log_hostname = on
log_line_prefix = ''
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'

And my configure/build options:
export LDFLAGS="-L/s/openssl-0.9.8e/lib -L/s/krb5-1.6.2/lib
-L/s/readline-5.2/lib -L/s/libxml2-2.6.30/lib
-Wl,-rpath,/s/openssl-0.9.8e/lib -Wl,-rpath,/s/krb5-1.6.2/lib
-Wl,-rpath,/s/readline-5.2/lib -Wl,-rpath,/s/libxml2-2.6.30/lib"
export CFLAGS="-I/s/openssl-0.9.8e/include -I/s/krb5-1.6.2/include
-I/s/readline-5.2/include -I/s/libxml2-2.6.30/include"
./configure --prefix=/s/postgresql-8.3-RC2/i386_rhel5 \
  --with-includes=/s/openssl-0.9.8e/include:/s/krb5-1.6.2/include:/s/readline-5.2/include:/s/libxml2-2.6.30/include
\
  --with-libraries=/s/openssl-0.9.8e/lib:/s/krb5-1.6.2/lib:/s/readline-5.2/lib:/s/libxml2-2.6.30/lib
\
  --with-gssapi \
  --with-krb5 \
  --with-openssl \
  --with-pam \
  --with-libxml \
  --with-perl \
  --with-python

> BTW, I couldn't get GSSAPI to work at all without a fully-specified -h
> option; it kept trying to use the wrong Kerberos principal names.
> I wonder whether that is related --- how did you persuade it to do the
> above?  Special sauce in krb5.conf maybe?

I'll check with our main kerberos admin to see if there's anything
weird in our krb5.conf. Is there anything in particular to look for?

However, I think the special sauce is in DNS. There's an unqualified
name in /etc/hosts that can resolve to the FQDN.

$ grep mitchell /etc/hosts
[IP address]  mitchell.cs.wisc.edu mitchell

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
"Peter Koczan"
Date:
> > No, we want a stack trace from the crash.  You could attach gdb to the
> > child process after forking and before you provoke the crash; or arrange
> > for a core dump file to be produced and gdb that.
>
> I should be able to get that for you tomorrow, if we don't figure out
> the issue sooner.

Here's a few stack traces for you.

Initial idle connection stack trace:

(gdb) bt
#0  0x006ac402 in __kernel_vsyscall ()
#1  0x005ffed3 in __read_nocancel () from /lib/libc.so.6
#2  0x00ad569e in sock_read () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#3  0x00000008 in ?? ()
#4  0x085f63a8 in ?? ()
#5  0x00000005 in ?? ()
#6  0x00b6b520 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e
#7  0x085c9a78 in ?? ()
#8  0x00000000 in ?? ()

After a bit of "step"-ing (15 to be exact) after sending the query to
provoke the crash, there's an I/O error reported, always at the same
address. Note that stepping goes through a few SSL functions, and
secure_read as well:

(gdb) step
Single stepping until exit from function PostgresMain,
which has no line number information.
Warning:
Cannot insert breakpoint -57.
Error accessing memory address 0x783e3459: Input/output error.

(gdb) bt
#0  0x00569990 in siglongjmp () from /lib/libc.so.6
#1  0x0830dec0 in pg_re_throw ()
#2  0x0830caa1 in errfinish ()
#3  0x0812d40d in ExpandAllTables ()
#4  0x0812d0b4 in ExpandColumnRefStar ()
#5  0x0812be20 in transformTargetList ()
#6  0x080fe481 in transformSelectStmt ()
#7  0x080fd7fc in transformStmt ()
#8  0x080fd60d in parse_analyze ()
#9  0x08240fd8 in pg_analyze_and_rewrite ()
#10 0x082413ee in exec_simple_query ()
#11 0x08245284 in PostgresMain ()
#12 0x08210b9e in BackendRun ()
#13 0x08210174 in BackendStartup ()
#14 0x0820dc53 in ServerLoop ()
#15 0x0820d631 in PostmasterMain ()
#16 0x081b2ee7 in main ()

3 more "step"s gets me here, to the segfault and connection termination:
(gdb) step
Single stepping until exit from function siglongjmp,
which has no line number information.

Program received signal SIGSEGV, Segmentation fault.
0x0056bfde in getenv () from /lib/libc.so.6
(gdb) bt
#0  0x0056bfde in getenv () from /lib/libc.so.6
#1  0x005c1490 in tzset_internal () from /lib/libc.so.6
#2  0x005c1fad in tzset () from /lib/libc.so.6
#3  0x005c6726 in strftime_l () from /lib/libc.so.6
#4  0x0060b67f in __vsyslog_chk () from /lib/libc.so.6
#5  0x0060bbaa in syslog () from /lib/libc.so.6
#6  0x0830e3f9 in write_syslog ()
#7  0x0830f7a8 in send_message_to_server_log ()
#8  0x0830daf4 in EmitErrorReport ()
#9  0x082450bd in PostgresMain ()
#10 0x08210b9e in BackendRun ()
#11 0x08210174 in BackendStartup ()
#12 0x0820dc53 in ServerLoop ()
#13 0x0820d631 in PostmasterMain ()
#14 0x081b2ee7 in main ()

One more step, the connection is gone:
(gdb) step
Single stepping until exit from function getenv,
which has no line number information.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
Tom Lane
Date:
"Peter Koczan" <pjkoczan@gmail.com> writes:
> Program received signal SIGSEGV, Segmentation fault.
> 0x0056bfde in getenv () from /lib/libc.so.6
> (gdb) bt
> #0  0x0056bfde in getenv () from /lib/libc.so.6
> #1  0x005c1490 in tzset_internal () from /lib/libc.so.6
> #2  0x005c1fad in tzset () from /lib/libc.so.6
> #3  0x005c6726 in strftime_l () from /lib/libc.so.6
> #4  0x0060b67f in __vsyslog_chk () from /lib/libc.so.6
> #5  0x0060bbaa in syslog () from /lib/libc.so.6
> #6  0x0830e3f9 in write_syslog ()
> #7  0x0830f7a8 in send_message_to_server_log ()
> #8  0x0830daf4 in EmitErrorReport ()
> #9  0x082450bd in PostgresMain ()
> #10 0x08210b9e in BackendRun ()
> #11 0x08210174 in BackendStartup ()
> #12 0x0820dc53 in ServerLoop ()
> #13 0x0820d631 in PostmasterMain ()
> #14 0x081b2ee7 in main ()

Hmm, crash inside getenv??  That's weird, that suggests something has
clobbered the environment-strings data structure.

[ pokes around in code... ]  And behold, here's a smoking gun:
pg_GSS_recvauth() is doing a putenv() with a string that it got from
palloc().  The active context at this point is PostmasterContext,
which *is* going to go away soon, leaving a gaping hole in the environ
list.

I still couldn't reproduce a crash with that knowledge, but I bet that
if you apply the attached patch, things will get better.

            regards, tom lane

Index: src/backend/libpq/auth.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/libpq/auth.c,v
retrieving revision 1.162
diff -c -r1.162 auth.c
*** auth.c    1 Jan 2008 19:45:49 -0000    1.162
--- auth.c    30 Jan 2008 04:07:42 -0000
***************
*** 384,390 ****
                  min_stat,
                  lmin_s,
                  gflags;
-     char       *kt_path;
      int            mtype;
      int            ret;
      StringInfoData buf;
--- 384,389 ----
***************
*** 398,408 ****
           * setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0); except setenv()
           * not always available.
           */
!         if (!getenv("KRB5_KTNAME"))
          {
!             kt_path = palloc(MAXPGPATH + 13);
!             snprintf(kt_path, MAXPGPATH + 13,
!                      "KRB5_KTNAME=%s", pg_krb_server_keyfile);
              putenv(kt_path);
          }
      }
--- 397,415 ----
           * setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0); except setenv()
           * not always available.
           */
!         if (getenv("KRB5_KTNAME") == NULL)
          {
!             size_t    kt_len = strlen(pg_krb_server_keyfile) + 14;
!             char   *kt_path = malloc(kt_len);
!
!             if (!kt_path)
!             {
!                 ereport(LOG,
!                         (errcode(ERRCODE_OUT_OF_MEMORY),
!                          errmsg("out of memory")));
!                 return STATUS_ERROR;
!             }
!             snprintf(kt_path, kt_len, "KRB5_KTNAME=%s", pg_krb_server_keyfile);
              putenv(kt_path);
          }
      }

Re: BUG #3902: Segmentation faults using GSSAPI authentication

From
"Peter Koczan"
Date:
> Hmm, crash inside getenv??  That's weird, that suggests something has
> clobbered the environment-strings data structure.
>
> [ pokes around in code... ]  And behold, here's a smoking gun:
> pg_GSS_recvauth() is doing a putenv() with a string that it got from
> palloc().  The active context at this point is PostmasterContext,
> which *is* going to go away soon, leaving a gaping hole in the environ
> list.
>
> I still couldn't reproduce a crash with that knowledge, but I bet that
> if you apply the attached patch, things will get better.

The patch worked like a charm. Thanks, Tom.

Peter