Thread: Psql crashes with Segmentation fault on copy from
I am setting up a new machine and preparing some standard benchmark tests. While trying to load some data using "copy from" psql is crashing. OS Red Hat Enterprise Linux ES release 4 (Nahant Update 6) Postgresql 8.2.7 installed from RPMs I ran strace on psql and got: read(5, "DAIRY QUEEN Q68\t5814\t1215"..., 4096) = 4096 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 ...... 36 other lines like the ones above/below. rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(5, "TORE 2653\t5311\t93\t5080284\n2qLMfi"..., 4096) = 3796 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 ...... 36 other lines like the ones above/below. rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0 read(5, "", 4096) = 0 gettimeofday({1211915058, 706827}, NULL) = 0 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ On the server I increased the verbosity to debug levels: DEBUG: 00000: checkpoint starting LOCATION: CreateCheckPoint, xlog.c:5584 DEBUG: 00000: checkpoint complete; 0 transaction log file(s) added, 0 removed, 0 recycled LOCATION: CreateCheckPoint, xlog.c:5712 LOG: 08P01: unexpected EOF on client connection LOCATION: SocketBackend, postgres.c:307 DEBUG: 00000: proc_exit(0) LOCATION: proc_exit, ipc.c:94 DEBUG: 00000: shmem_exit(0) LOCATION: shmem_exit, ipc.c:125 LOG: 00000: disconnection: session time: 0:28:57.253 user=benchmark database=benchmark host=[local] LOCATION: log_disconnections, postgres.c:3840 DEBUG: 00000: exit(0) LOCATION: proc_exit, ipc.c:112 DEBUG: 00000: server process (PID 3112) exited with exit code 0 LOCATION: LogChildExit, postmaster.c:2425 The same data file loads into another machine with Postgresql 8.2.4 and the same version of RedHat.. Checked the disk with the badblocks command. No errors found.
"Francisco Reyes" <lists@stringsutils.com> writes: > I am setting up a new machine and preparing some standard benchmark tests. > While trying to load some data using "copy from" psql is crashing. Can you get us a stack trace from the crash? (You'd likely need to install the postgresql-debuginfo RPM to get a useful trace.) Alternatively, can you provide a self-contained test case? regards, tom lane
On 6:28 pm 05/27/08 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Can you get us a stack trace from the crash? Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 182894175648 (LWP 4487)] 0x0000003cc31723e6 in memcpy () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000003cc31723e6 in memcpy () from /lib64/tls/libc.so.6 #1 0x000000364bf0e0ae in PQunescapeBytea () from /usr/lib64/libpq.so.5 #2 0x000000364bf0e230 in PQunescapeBytea () from /usr/lib64/libpq.so.5 #3 0x000000364bf0c09e in PQsendQuery () from /usr/lib64/libpq.so.5 #4 0x000000364bf0c788 in PQexec () from /usr/lib64/libpq.so.5 #5 0x0000000000406e95 in ?? () #6 0x0000000000409dfa in ?? () #7 0x000000000040408d in ?? () #8 0x00000000004057cd in ?? () #9 0x0000000000406286 in ?? () #10 0x0000000000409f66 in ?? () #11 0x000000000040408d in ?? () #12 0x00000000004057cd in ?? () #13 0x0000000000406286 in ?? () #14 0x0000000000409f66 in ?? () #15 0x000000000040c687 in ?? () #16 0x0000003cc311c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #17 0x0000000000403d2a in ?? () #18 0x0000007fbffff558 in ?? () #19 0x000000000000001c in ?? () #20 0x0000000000000001 in ?? () #21 0x0000007fbffff795 in ?? () #22 0x0000000000000000 in ?? () Is that what you need? I installed compat-postgresql-libs-debuginfo-3-2PGDG.rhel4.x86_64.rpm postgresql-debuginfo-8.2.7-1PGDG.rhel4.x86_64.rpm What is compat-postgresql-libs-debuginfo? Neither of those two RPMs are described in the RPM install PDF by Devrim and Lamar.
"Francisco Reyes" <lists@stringsutils.com> writes: > (gdb) bt > #0 0x0000003cc31723e6 in memcpy () from /lib64/tls/libc.so.6 > #1 0x000000364bf0e0ae in PQunescapeBytea () from /usr/lib64/libpq.so.5 > #2 0x000000364bf0e230 in PQunescapeBytea () from /usr/lib64/libpq.so.5 > #3 0x000000364bf0c09e in PQsendQuery () from /usr/lib64/libpq.so.5 > #4 0x000000364bf0c788 in PQexec () from /usr/lib64/libpq.so.5 > #5 0x0000000000406e95 in ?? () > #6 0x0000000000409dfa in ?? () > #7 0x000000000040408d in ?? () > Is that what you need? Well, it would be if it were right; but PQsendQuery doesn't call PQunescapeBytea, so there's something wrong with the debug info. (The fact that we're not seeing any parameter values is another tip that it's not right. Apparently gdb doesn't think the debuginfo matches at all.) > I installed > compat-postgresql-libs-debuginfo-3-2PGDG.rhel4.x86_64.rpm > postgresql-debuginfo-8.2.7-1PGDG.rhel4.x86_64.rpm Do those *exactly* match the versions of the Postgres RPMs you're using? regards, tom lane
On 11:09 am 05/28/08 Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I installed > > compat-postgresql-libs-debuginfo-3-2PGDG.rhel4.x86_64.rpm > > postgresql-debuginfo-8.2.7-1PGDG.rhel4.x86_64.rpm > > Do those *exactly* match the versions of the Postgres RPMs you're > using? I got them from the same directory as the rest of the 8.2.7 RPMs I downloaded. I am going to try uninstalling the RPMs and using the Source RPMS and report back. There is no production data yet since I am working on testing the machine and creating some benchmarks so I can redo the entire setup as needed.
On 11:09 am 05/28/08 Tom Lane <tgl@sss.pgh.pa.us> wrote: Re-send. Didn't do reply-all before. > Well, it would be if it were right; but PQsendQuery doesn't call > PQunescapeBytea, so there's something wrong with the debug info. .. > > compat-postgresql-libs-debuginfo-3-2PGDG.rhel4.x86_64.rpm That was the wrong lib. Installed postgresql-debuginfo-8.2.7-1PGDG.rhel4.x86_64.rpm and got Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 182895355296 (LWP 5008)] 0x0000003cc31723e6 in memcpy () from /lib64/tls/libc.so.6 (gdb) bt #0 0x0000003cc31723e6 in memcpy () from /lib64/tls/libc.so.6 #1 0x0000002a955820ae in pqPutMsgBytes (buf=0x2a9860a010, len=2147483647, conn=0x560a80) at fe-misc.c:475 #2 0x0000002a95582230 in pqPuts ( s=0x2a9860a010 "yjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t10.47\t08620667742MDF1\t\tMCDONALD'S F7742\t5814\t1003\t1187954\nyjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t16.99\t00614111869AWG1\t\tPIGGLY WIGGLY#52\t5411"..., conn=0x560a80) at fe-misc.c:151 #3 0x0000002a9558009e in PQsendQuery (conn=0x560a80, query=0x2a9860a010 "yjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t10.47\t08620667742MDF1\t\tMCDONALD'S F7742\t5814\t1003\t1187954\nyjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t16.99\t00614111869AWG1\t\tPIGGLY WIGGLY#52\t5411"...) at fe-exec.c:687 #4 0x0000002a95580788 in PQexec (conn=0x560a80, query=0x2a9860a010 "yjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t10.47\t08620667742MDF1\t\tMCDONALD'S F7742\t5814\t1003\t1187954\nyjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t16.99\t00614111869AWG1\t\tPIGGLY WIGGLY#52\t5411"...) at fe-exec.c:1288 #5 0x0000000000406e95 in SendQuery ( query=0x2a9860a010 "yjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t10.47\t08620667742MDF1\t\tMCDONALD'S F7742\t5814\t1003\t1187954\nyjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t16.99\t00614111869AWG1\t\tPIGGLY WIGGLY#52\t5411"...) at common.c:878 #6 0x0000000000409dfa in MainLoop (source=0x5414f0) at mainloop.c:340 #7 0x000000000040408d in process_file (filename=0x541d70 "data/usb_t_60M.sql", single_txn=0 '\0') at command.c:1394 #8 0x00000000004057cd in exec_command (cmd=0x541bc0 "i", scan_state=0x549b30, query_buf=0x549890) at command.c:546 #9 0x0000000000406286 in HandleSlashCmds (scan_state=0x549b30, query_buf=0x549890) at command.c:92 #10 0x0000000000409f66 in MainLoop (source=0x541070) at mainloop.c:259 #11 0x000000000040408d in process_file (filename=0x560750 "bench-index-after.sql", single_txn=0 '\0') at command.c:1394 #12 0x00000000004057cd in exec_command ( cmd=0x569360 "Hxnh9aOwMviVVP9I=\t200803\t2008-03-10\t0\t15.54\t434240509886\t\tNU-YALE CLEANERS (BARD\t4225\t0\t11636644\n3MM9gwBGMHiyPGmUrqIEOOha1Ao=\t200803\t2008-03-11\t0\t739.95\t554172100100007\t\tUSAAP&C PREMIUM\t6300\t0\t183718"..., scan_state=0x549c20, query_buf=0x54a480) at command.c:546 #13 0x0000000000406286 in HandleSlashCmds (scan_state=0x549c20, query_buf=0x54a480) at command.c:92 #14 0x0000000000409f66 in MainLoop (source=0x3cc332fb00) at mainloop.c:259 #15 0x000000000040c687 in main (argc=80207, argv=0x0) at startup.c:367 #16 0x0000003cc311c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #17 0x0000000000403d2a in _start () #18 0x0000007fbffff558 in ?? () #19 0x000000000000001c in ?? () #20 0x0000000000000001 in ?? () #21 0x0000007fbffff795 in ?? () #22 0x0000000000000000 in ?? ()
"Francisco Reyes" <lists@stringsutils.com> writes: > #1 0x0000002a955820ae in pqPutMsgBytes (buf=0x2a9860a010, len=2147483647, So it's trying to execute an sql query that's MAXINT bytes long which is pretty off-course. > #5 0x0000000000406e95 in SendQuery ( > query=0x2a9860a010 "yjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t10.47\t08620667742MDF1\t\tMCDONALD'S F7742\t5814\t1003\t1187954\nyjbjK8WKLRHKavptOqlPD4QeI24=\t200803\t2008-03-11\t0\t16.99\t00614111869AWG1\t\tPIGGLY WIGGLY#52\t5411"...) > at common.c:878 > #6 0x0000000000409dfa in MainLoop (source=0x5414f0) at mainloop.c:340 > #7 0x000000000040408d in process_file (filename=0x541d70 > "data/usb_t_60M.sql", single_txn=0 '\0') at command.c:1394 > #8 0x00000000004057cd in exec_command (cmd=0x541bc0 "i", > scan_state=0x549b30, query_buf=0x549890) at command.c:546 Could you post the first couple lines of data/usb_t_60M.sql ? Does it really have a COPY command at the beginning? Are you really doing \i data/usb_t_60M.sql or were you trying to do a copy from this file? > #9 0x0000000000406286 in HandleSlashCmds (scan_state=0x549b30, > query_buf=0x549890) at command.c:92 > #10 0x0000000000409f66 in MainLoop (source=0x541070) at mainloop.c:259 > #11 0x000000000040408d in process_file (filename=0x560750 > "bench-index-after.sql", single_txn=0 '\0') at command.c:1394 > #12 0x00000000004057cd in exec_command ( > cmd=0x569360 "Hxnh9aOwMviVVP9I=\t200803\t2008-03-10\t0\t15.54\t434240509886\t\tNU-YALE CLEANERS (BARD\t4225\t0\t11636644\n3MM9gwBGMHiyPGmUrqIEOOha1Ao=\t200803\t2008-03-11\t0\t739.95\t554172100100007\t\tUSAAP&C PREMIUM\t6300\t0\t183718"..., > scan_state=0x549c20, query_buf=0x54a480) at command.c:546 I'm a bit puzzled by this. The command being executed here seems to have been overwritten by data from the later command -- before it was overwritten it was apparently "\i bench-index-after.sql". It seems something with the PsqlScanState is not being sufficient to make the lexer completely reentrant here. What version of flex is this built with? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On 3:09 pm 05/28/08 Gregory Stark <stark@enterprisedb.com> wrote: >Does it really have a COPY command at the beginning? Are you really doing >\i data/usb_t_60M.sql or were you trying to doa copy from this file? Argh..That's it. When I re-organized the scripts I must have taken the copy command from the top of the data file and did not put a 'copyfrom' in the calling script. Thanks!!! Is there a slowdown by keeping the debug RPM? Or should I only have it when trying to troubleshoot a problem?
On Wed, May 28, 2008 at 03:42:47PM -0400, Francisco Reyes wrote: > Is there a slowdown by keeping the debug RPM? > Or should I only have it when trying to troubleshoot a problem? Debug info costs nothing if its not being used. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
"Francisco Reyes" <lists@stringsutils.com> writes: > On 3:09 pm 05/28/08 Gregory Stark <stark@enterprisedb.com> wrote: >> Does it really have a COPY command at the beginning? Are you really doing >\i data/usb_t_60M.sql or were you trying todo a copy from this file? > Argh..That's it. > When I re-organized the scripts I must have taken the copy command from the top of the data file and did not put a 'copyfrom' in the calling script. Hmm ... even so, it shouldn't have crashed, it should at worst have given you an out-of-memory error message. regards, tom lane
I wrote: > "Francisco Reyes" <lists@stringsutils.com> writes: >> On 3:09 pm 05/28/08 Gregory Stark <stark@enterprisedb.com> wrote: > Does it really have a COPY command at the beginning? Are you really doing >\i data/usb_t_60M.sql or were you trying todo a copy from this file? >> Argh..That's it. >> When I re-organized the scripts I must have taken the copy command from the top of the data file and did not put a 'copyfrom' in the calling script. > Hmm ... even so, it shouldn't have crashed, it should at worst have > given you an out-of-memory error message. Hmm, I see the problem: the pqexpbuffer code maxes out at INT_MAX bytes in the string buffer, which would be all right except that it has no good way to report the error and so the input is just getting truncated at that length. But then what happens is that pqPuts computes strlen(s) + 1, which is still all right because size_t is 64 bits on this machine, and passes that to pqPutMsgBytes, which computes conn->outMsgEnd + len *and smashes that to int*. So the value passed to pqCheckOutBufferSpace is negative, and the latter falsely concludes there's enough space in the buffer, and then the memcpy goes boom. A minimum solution would be to make pqCheckOutBufferSpace deal in size_t not int, but I have a feeling there are a lot of other similar gotchas when running this code on a 64-bit machine. We use int arithmetic an awful lot for stuff that probably should be size_t or ssize_t ... regards, tom lane
Tom Lane wrote: > I wrote: > > "Francisco Reyes" <lists@stringsutils.com> writes: > >> On 3:09 pm 05/28/08 Gregory Stark <stark@enterprisedb.com> wrote: > > Does it really have a COPY command at the beginning? Are you really doing >\i data/usb_t_60M.sql or were you trying todo a copy from this file? > > >> Argh..That's it. > >> When I re-organized the scripts I must have taken the copy command from the top of the data file and did not put a 'copyfrom' in the calling script. > > > Hmm ... even so, it shouldn't have crashed, it should at worst have > > given you an out-of-memory error message. > > Hmm, I see the problem: the pqexpbuffer code maxes out at INT_MAX bytes > in the string buffer, which would be all right except that it has no > good way to report the error and so the input is just getting truncated > at that length. But then what happens is that pqPuts computes > strlen(s) + 1, which is still all right because size_t is 64 bits > on this machine, and passes that to pqPutMsgBytes, which computes > conn->outMsgEnd + len *and smashes that to int*. So the value passed > to pqCheckOutBufferSpace is negative, and the latter falsely concludes > there's enough space in the buffer, and then the memcpy goes boom. > > A minimum solution would be to make pqCheckOutBufferSpace deal in > size_t not int, but I have a feeling there are a lot of other similar > gotchas when running this code on a 64-bit machine. We use int > arithmetic an awful lot for stuff that probably should be size_t > or ssize_t ... I assume this is not a TODO item. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +