Thread: BUG #3902: Segmentation faults using GSSAPI authentication
The following bug has been logged online: Bug reference: 3902 Logged by: Peter Koczan Email address: pjkoczan@gmail.com PostgreSQL version: 8.3RC2 Operating system: Red Hat Enterprise Linux 5 Description: Segmentation faults using GSSAPI authentication Details: Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism) after the server has been under a very slight load results in an unusable but still running server. I was able to produce this problem by trying to connect to the database, run a few commands, disconnect, and trying to reconnect. I was also able to produce this when trying to connect via GSSAPI to a server when it was restoring a database (using local trust authentication). The syslog output shows many of these errors when trying to connect from one such server process: Jan 24 15:46:17 mitchell postgres[14663]: [1-1] LOG: could not load root certificate file "root.crt": No such file or directory Jan 24 15:46:17 mitchell postgres[14663]: [1-2] DETAIL: Will not verify client certificates. Jan 24 15:46:17 mitchell postgres[14663]: [2-1] LOG: database system is ready to accept connections Jan 24 16:16:18 mitchell postgres[14663]: [3-1] LOG: received SIGHUP, reloading configuration files Jan 24 16:18:31 mitchell postgres[14663]: [4-1] LOG: server process (PID 15004) was terminated by signal 11: Segmentation fault Jan 24 16:18:31 mitchell postgres[14663]: [5-1] LOG: terminating any other active server processes Jan 24 16:18:31 mitchell postgres[14663]: [6-1] LOG: all server processes terminated; reinitializing Jan 24 16:18:31 mitchell postgres[14663]: [7-1] LOG: database system is ready to accept connections Jan 24 16:48:49 mitchell postgres[14663]: [8-1] LOG: received smart shutdown request The client connection shows this: $ /s/postgresql-8.3-RC2/bin/psql -h mitchell -p 5434 psql: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. Syslog for a connection shows this: Jan 24 16:16:59 mitchell postgres[15004]: [4-1] LOG: connection received: host=ator.cs.wisc.edu port=35608 Jan 24 16:16:59 mitchell postgres[15004]: [5-1] LOG: connection authorized: user=koczan database=postgres It is important to note that connecting via native Kerberos 5 does not cause this problem, nor does going via md5 or trust authentication. I compiled against Kerberos 5 - 1.6.2
"Peter Koczan" <pjkoczan@gmail.com> writes: > Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism) > after the server has been under a very slight load results in an unusable > but still running server. Can you get us a stack trace from that crash? regards, tom lane
"Peter Koczan" <pjkoczan@gmail.com> writes: > Looking into it more, it looks like the server is restarting every > time it encounters this. I was wrong thinking that it stayed crashed, > I guess I was just looking at a stale connection. I think you misunderstand what is happening. The parent postmaster process is not restarting, because it did not crash. The crash is happening in a child process that is forked off by the postmaster to service a particular connection. > Since the server restarts and any connections either go away forever > or just reset, post-crash stack traces won't do much good. No, we want a stack trace from the crash. You could attach gdb to the child process after forking and before you provoke the crash; or arrange for a core dump file to be produced and gdb that. regards, tom lane
On Jan 25, 2008 4:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Can you get us a stack trace from that crash? Here's the trace of the server process post-crash. [root@mitchell ~]# gdb -p 24078 (gdb) bt #0 0x006ac402 in __kernel_vsyscall () #1 0x0060801d in ___newselect_nocancel () from /lib/libc.so.6 #2 0x0820db22 in ServerLoop () #3 0x0820d631 in PostmasterMain () #4 0x081b2ee7 in main () Looking into it more, it looks like the server is restarting every time it encounters this. I was wrong thinking that it stayed crashed, I guess I was just looking at a stale connection. However, I did find a way to trigger this behavior, and it still only happens with GSSAPI auth. Basically, run a statement with a syntax error while logged in under GSSAPI (I think anything that generates an ERROR level message will work). Here's the transcript of a GSSAPI connection: postgres=> select current_user; current_user -------------- nobody (1 row) postgres=> select * from pg_database 1; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> select * from pg_database 1; You are currently not connected to a database. !> And of an md5-based connection (for user nobody, note that native krb5 connections exhibit this same behavior): postgres=> select * from pg_database 1; ERROR: syntax error at or near "1" LINE 1: select * from pg_database 1; ^ postgres=> select current_user; current_user -------------- nobody (1 row) postgres=> select * from pg_database 1; WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. postgres=> And the syslog entries show this: Jan 26 19:37:38 mitchell postgres[24586]: [12-1] LOG: connection received: host=ator.cs.wisc.edu port=59442 Jan 26 19:37:38 mitchell postgres[24586]: [13-1] LOG: connection authorized: user=nobody database=postgres Jan 26 19:37:47 mitchell postgres[24587]: [12-1] LOG: connection received: host=ator.cs.wisc.edu port=59443 Jan 26 19:37:47 mitchell postgres[24587]: [13-1] LOG: connection authorized: user=koczan database=postgres Jan 26 19:38:04 mitchell postgres[24586]: [14-1] ERROR: syntax error at or near "1" at character 27 Jan 26 19:38:04 mitchell postgres[24586]: [14-2] STATEMENT: select * from pg_database 1; Jan 26 19:38:11 mitchell postgres[24078]: [12-1] LOG: server process (PID 24587) was terminated by signal 11: Segmentation fault Jan 26 19:38:11 mitchell postgres[24078]: [13-1] LOG: terminating any other active server processes Jan 26 19:38:11 mitchell postgres[24586]: [15-1] WARNING: terminating connection because of crash of another server process Jan 26 19:38:11 mitchell postgres[24586]: [15-2] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server Jan 26 19:38:11 mitchell postgres[24586]: [15-3] process exited abnormally and possibly corrupted shared memory. Jan 26 19:38:11 mitchell postgres[24586]: [15-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. Jan 26 19:38:11 mitchell postgres[24078]: [14-1] LOG: all server processes terminated; reinitializing Jan 26 19:38:11 mitchell postgres[24607]: [15-1] LOG: database system was interrupted; last known up at 2008-01-26 19:25:52 CST Jan 26 19:38:11 mitchell postgres[24607]: [16-1] LOG: database system was not properly shut down; automatic recovery in progress Jan 26 19:38:11 mitchell postgres[24608]: [15-1] LOG: connection received: host=ator.cs.wisc.edu port=59446 Jan 26 19:38:11 mitchell postgres[24607]: [17-1] LOG: record with zero length at 0/87A5DC Jan 26 19:38:11 mitchell postgres[24607]: [18-1] LOG: redo is not required Jan 26 19:38:11 mitchell postgres[24607]: [19-1] LOG: checkpoint starting: shutdown immediate Jan 26 19:38:11 mitchell postgres[24607]: [20-1] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.002 s, Jan 26 19:38:11 mitchell postgres[24607]: [20-2] sync=0.000 s, total=0.009 s Jan 26 19:38:11 mitchell postgres[24611]: [15-1] LOG: autovacuum launcher started Jan 26 19:38:11 mitchell postgres[24078]: [15-1] LOG: database system is ready to accept connections Jan 26 19:38:11 mitchell postgres[24608]: [16-1] FATAL: the database system is in recovery mode Jan 26 19:38:11 mitchell postgres[24613]: [16-1] LOG: connection received: host=ator.cs.wisc.edu port=59447 Jan 26 19:38:11 mitchell postgres[24613]: [17-1] FATAL: no pg_hba.conf entry for host "128.105.162.36", user "koczan", database "postgres", SSL off Jan 26 19:38:22 mitchell postgres[24615]: [16-1] LOG: connection received: host=ator.cs.wisc.edu port=59448 Jan 26 19:38:22 mitchell postgres[24615]: [17-1] LOG: connection authorized: user=nobody database=postgres Since the server restarts and any connections either go away forever or just reset, post-crash stack traces won't do much good. If there's a way to get a stack-trace at the point of the crash, please let me know how and I'll get you that information. Also, why the connection is trying to reconnect without SSL intrigues and concerns me. Peter
"Peter Koczan" <pjkoczan@gmail.com> writes: > Trying to connect via GSSAPI (with Kerberos 5 as the underlying mechanism) > after the server has been under a very slight load results in an unusable > but still running server. I couldn't reproduce this at all on Fedora 8. I speculate that you've not told us enough about the configuration you're using. Given that the crash doesn't happen until after "connection authorized" is logged, it seems likely that the problem is not purely GSSAPI's fault but is an interaction with some other option that I happen not to be using. Please show your configure options and all non-default postgresql.conf settings. > The client connection shows this: > $ /s/postgresql-8.3-RC2/bin/psql -h mitchell -p 5434 > psql: server closed the connection unexpectedly BTW, I couldn't get GSSAPI to work at all without a fully-specified -h option; it kept trying to use the wrong Kerberos principal names. I wonder whether that is related --- how did you persuade it to do the above? Special sauce in krb5.conf maybe? regards, tom lane
> I think you misunderstand what is happening. The parent postmaster > process is not restarting, because it did not crash. The crash is > happening in a child process that is forked off by the postmaster to > service a particular connection. I see what you're saying, that it's one particular connection issue causing these problems. I thought it was a postmaster issue since the database went into recovery mode for a bit and *all* connections attempted to reset. > No, we want a stack trace from the crash. You could attach gdb to the > child process after forking and before you provoke the crash; or arrange > for a core dump file to be produced and gdb that. I should be able to get that for you tomorrow, if we don't figure out the issue sooner. > I couldn't reproduce this at all on Fedora 8. I speculate that you've > not told us enough about the configuration you're using. Given that the > crash doesn't happen until after "connection authorized" is logged, it > seems likely that the problem is not purely GSSAPI's fault but is an > interaction with some other option that I happen not to be using. > Please show your configure options and all non-default postgresql.conf > settings. postgresql.conf entries (uncommented ones): listen_addresses = '*' port = 5434 max_connections = 300 unix_socket_directory = '/tmp' ssl = on krb_server_keyfile = '/etc/v5srvtab.postgres' work_mem = 10MB shared_buffers = 250MB max_fsm_pages = 250000 max_fsm_relations = 5000 default_statistics_target = 100 log_destination = 'syslog' syslog_facility = 'LOCAL1' log_checkpoints = on log_connections = on log_disconnections = on log_duration = off log_hostname = on log_line_prefix = '' datestyle = 'iso, mdy' lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' And my configure/build options: export LDFLAGS="-L/s/openssl-0.9.8e/lib -L/s/krb5-1.6.2/lib -L/s/readline-5.2/lib -L/s/libxml2-2.6.30/lib -Wl,-rpath,/s/openssl-0.9.8e/lib -Wl,-rpath,/s/krb5-1.6.2/lib -Wl,-rpath,/s/readline-5.2/lib -Wl,-rpath,/s/libxml2-2.6.30/lib" export CFLAGS="-I/s/openssl-0.9.8e/include -I/s/krb5-1.6.2/include -I/s/readline-5.2/include -I/s/libxml2-2.6.30/include" ./configure --prefix=/s/postgresql-8.3-RC2/i386_rhel5 \ --with-includes=/s/openssl-0.9.8e/include:/s/krb5-1.6.2/include:/s/readline-5.2/include:/s/libxml2-2.6.30/include \ --with-libraries=/s/openssl-0.9.8e/lib:/s/krb5-1.6.2/lib:/s/readline-5.2/lib:/s/libxml2-2.6.30/lib \ --with-gssapi \ --with-krb5 \ --with-openssl \ --with-pam \ --with-libxml \ --with-perl \ --with-python > BTW, I couldn't get GSSAPI to work at all without a fully-specified -h > option; it kept trying to use the wrong Kerberos principal names. > I wonder whether that is related --- how did you persuade it to do the > above? Special sauce in krb5.conf maybe? I'll check with our main kerberos admin to see if there's anything weird in our krb5.conf. Is there anything in particular to look for? However, I think the special sauce is in DNS. There's an unqualified name in /etc/hosts that can resolve to the FQDN. $ grep mitchell /etc/hosts [IP address] mitchell.cs.wisc.edu mitchell
> > No, we want a stack trace from the crash. You could attach gdb to the > > child process after forking and before you provoke the crash; or arrange > > for a core dump file to be produced and gdb that. > > I should be able to get that for you tomorrow, if we don't figure out > the issue sooner. Here's a few stack traces for you. Initial idle connection stack trace: (gdb) bt #0 0x006ac402 in __kernel_vsyscall () #1 0x005ffed3 in __read_nocancel () from /lib/libc.so.6 #2 0x00ad569e in sock_read () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #3 0x00000008 in ?? () #4 0x085f63a8 in ?? () #5 0x00000005 in ?? () #6 0x00b6b520 in ?? () from /s/openssl-0.9.8e/lib/libcrypto.so.0.9.8e #7 0x085c9a78 in ?? () #8 0x00000000 in ?? () After a bit of "step"-ing (15 to be exact) after sending the query to provoke the crash, there's an I/O error reported, always at the same address. Note that stepping goes through a few SSL functions, and secure_read as well: (gdb) step Single stepping until exit from function PostgresMain, which has no line number information. Warning: Cannot insert breakpoint -57. Error accessing memory address 0x783e3459: Input/output error. (gdb) bt #0 0x00569990 in siglongjmp () from /lib/libc.so.6 #1 0x0830dec0 in pg_re_throw () #2 0x0830caa1 in errfinish () #3 0x0812d40d in ExpandAllTables () #4 0x0812d0b4 in ExpandColumnRefStar () #5 0x0812be20 in transformTargetList () #6 0x080fe481 in transformSelectStmt () #7 0x080fd7fc in transformStmt () #8 0x080fd60d in parse_analyze () #9 0x08240fd8 in pg_analyze_and_rewrite () #10 0x082413ee in exec_simple_query () #11 0x08245284 in PostgresMain () #12 0x08210b9e in BackendRun () #13 0x08210174 in BackendStartup () #14 0x0820dc53 in ServerLoop () #15 0x0820d631 in PostmasterMain () #16 0x081b2ee7 in main () 3 more "step"s gets me here, to the segfault and connection termination: (gdb) step Single stepping until exit from function siglongjmp, which has no line number information. Program received signal SIGSEGV, Segmentation fault. 0x0056bfde in getenv () from /lib/libc.so.6 (gdb) bt #0 0x0056bfde in getenv () from /lib/libc.so.6 #1 0x005c1490 in tzset_internal () from /lib/libc.so.6 #2 0x005c1fad in tzset () from /lib/libc.so.6 #3 0x005c6726 in strftime_l () from /lib/libc.so.6 #4 0x0060b67f in __vsyslog_chk () from /lib/libc.so.6 #5 0x0060bbaa in syslog () from /lib/libc.so.6 #6 0x0830e3f9 in write_syslog () #7 0x0830f7a8 in send_message_to_server_log () #8 0x0830daf4 in EmitErrorReport () #9 0x082450bd in PostgresMain () #10 0x08210b9e in BackendRun () #11 0x08210174 in BackendStartup () #12 0x0820dc53 in ServerLoop () #13 0x0820d631 in PostmasterMain () #14 0x081b2ee7 in main () One more step, the connection is gone: (gdb) step Single stepping until exit from function getenv, which has no line number information. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists.
"Peter Koczan" <pjkoczan@gmail.com> writes: > Program received signal SIGSEGV, Segmentation fault. > 0x0056bfde in getenv () from /lib/libc.so.6 > (gdb) bt > #0 0x0056bfde in getenv () from /lib/libc.so.6 > #1 0x005c1490 in tzset_internal () from /lib/libc.so.6 > #2 0x005c1fad in tzset () from /lib/libc.so.6 > #3 0x005c6726 in strftime_l () from /lib/libc.so.6 > #4 0x0060b67f in __vsyslog_chk () from /lib/libc.so.6 > #5 0x0060bbaa in syslog () from /lib/libc.so.6 > #6 0x0830e3f9 in write_syslog () > #7 0x0830f7a8 in send_message_to_server_log () > #8 0x0830daf4 in EmitErrorReport () > #9 0x082450bd in PostgresMain () > #10 0x08210b9e in BackendRun () > #11 0x08210174 in BackendStartup () > #12 0x0820dc53 in ServerLoop () > #13 0x0820d631 in PostmasterMain () > #14 0x081b2ee7 in main () Hmm, crash inside getenv?? That's weird, that suggests something has clobbered the environment-strings data structure. [ pokes around in code... ] And behold, here's a smoking gun: pg_GSS_recvauth() is doing a putenv() with a string that it got from palloc(). The active context at this point is PostmasterContext, which *is* going to go away soon, leaving a gaping hole in the environ list. I still couldn't reproduce a crash with that knowledge, but I bet that if you apply the attached patch, things will get better. regards, tom lane Index: src/backend/libpq/auth.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/libpq/auth.c,v retrieving revision 1.162 diff -c -r1.162 auth.c *** auth.c 1 Jan 2008 19:45:49 -0000 1.162 --- auth.c 30 Jan 2008 04:07:42 -0000 *************** *** 384,390 **** min_stat, lmin_s, gflags; - char *kt_path; int mtype; int ret; StringInfoData buf; --- 384,389 ---- *************** *** 398,408 **** * setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0); except setenv() * not always available. */ ! if (!getenv("KRB5_KTNAME")) { ! kt_path = palloc(MAXPGPATH + 13); ! snprintf(kt_path, MAXPGPATH + 13, ! "KRB5_KTNAME=%s", pg_krb_server_keyfile); putenv(kt_path); } } --- 397,415 ---- * setenv("KRB5_KTNAME", pg_krb_server_keyfile, 0); except setenv() * not always available. */ ! if (getenv("KRB5_KTNAME") == NULL) { ! size_t kt_len = strlen(pg_krb_server_keyfile) + 14; ! char *kt_path = malloc(kt_len); ! ! if (!kt_path) ! { ! ereport(LOG, ! (errcode(ERRCODE_OUT_OF_MEMORY), ! errmsg("out of memory"))); ! return STATUS_ERROR; ! } ! snprintf(kt_path, kt_len, "KRB5_KTNAME=%s", pg_krb_server_keyfile); putenv(kt_path); } }
> Hmm, crash inside getenv?? That's weird, that suggests something has > clobbered the environment-strings data structure. > > [ pokes around in code... ] And behold, here's a smoking gun: > pg_GSS_recvauth() is doing a putenv() with a string that it got from > palloc(). The active context at this point is PostmasterContext, > which *is* going to go away soon, leaving a gaping hole in the environ > list. > > I still couldn't reproduce a crash with that knowledge, but I bet that > if you apply the attached patch, things will get better. The patch worked like a charm. Thanks, Tom. Peter