Thread: Segmentation fault while COPY in 7.3
Help! Backend crashes on any COPY command I issue: PostgreSQL is 7.3 release, is compiled from sources. The message I get is: apb=# COPY maras2.mrk_yyn (bsvr_no, yyn_tur, yyn_no, yyn_syf, yyn_trh) TO stdout; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !# The output of strace on backend is following: # strace -p 18883 recv(7, "QCOPY maras2.mrk_yyn (bsvr_no, y"..., 8192, 0) = 77 gettimeofday({1038713111, 629131}, NULL) = 0 _llseek(11, 0, [0], SEEK_SET) = 0 read(11, "\0\0\0\0\20\0\0\0\1\0\0\0\24\0\360\37\360\37\1 b1\5\0\1"..., 8192) = 8192 read(11, "\0\0\0\0x#\375\"\20\0\0\0008\0h\35\360\37\1 \370\235\220"..., 8192) = 8192 _llseek(12, 0, [0], SEEK_SET) = 0 read(12, "\0\0\0\0\0#\375\"\20\0\0\0008\0@\34\0 \1 \200\237\0\1\0"..., 8192) = 8192 _llseek(21, 40960, [40960], SEEK_SET) = 0 read(21, "\0\0\0\0\4\251\3#\20\0\0\0004\4p\17\360\37\1 \340\237 "..., 8192) = 8192 _llseek(22, 106496, [106496], SEEK_SET) = 0 read(22, "\0\0\0\0(\326\377\"\20\0\0\0\10\1\200\1\0 \1 \200\237\0"..., 8192) = 8192 read(17, "\0\0\0\0\20\0\0\0\1\0\0\0\360\0\20\1\0 \1 p\237 \1\340"..., 8192) = 8192 open("/data/pgsql/base/2015749/2015757", O_RDWR|O_LARGEFILE) = 39 _llseek(39, 0, [20037632], SEEK_END) = 0 _llseek(39, 0, [0], SEEK_SET) = 0 read(39, "\0\0\0\0\270\7\4#\20\0\0\0\300\1\350\1\0 \1 \270\237\216"..., 8192) = 8192 --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++ # I have recompiled postgres with debug information enabled, run gdb, attached to backend process and caught SIGSEGV as following: (gdb) continue Continuing. Program received signal SIGSEGV, Segmentation fault. 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 480 (*header->context->methods->free_p) (header->context, pointer); (gdb) The output of bt is: (gdb) bt #0 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 #1 0x080b37be in CopyTo (rel=0x402be088, attnumlist=0x828a3f8, binary=0 '\0', oids=0 '\0', fp=0x0, delim=0x81c2a57 "\t", null_print=0x81b2c22 "\\N") at copy.c:671 #2 0x080b32f6 in DoCopy (stmt=0x827b398) at copy.c:491 #3 0x08118a6a in pg_exec_query_string (query_string=0x827af48, dest=Remote, parse_context=0x8245378) at postgres.c:789 #4 0x08119b89 in PostgresMain (argc=4, argv=0xbfffd2a0, username=0x8240a81 "postgres") at postgres.c:2016 #5 0x08101d7c in DoBackend (port=0x8240950) at postmaster.c:2293 #6 0x081016c2 in BackendStartup (port=0x8240950) at postmaster.c:1915 #7 0x08100875 in ServerLoop () at postmaster.c:1000 #8 0x08100326 in PostmasterMain (argc=1, argv=0x82270c0) at postmaster.c:779 #9 0x080debab in main (argc=1, argv=0xbfffdc34) at main.c:210 #10 0x42017589 in __libc_start_main () from /lib/i686/libc.so.6 (gdb) Thanks in advance, Nic
While waiting for help I decided to fix my problem by brute-forcing it. I commented out offending call to pfree() in src/backend/commands/copy.c at line 671. I may introduced a memory leak, but it works fine for me now. Best regards, Nic. *** ./src/backend/commands/copy.c.orig Sun Dec 1 06:02:34 2002 --- ./src/backend/commands/copy.c Sun Dec 1 06:02:48 2002 *************** *** 668,674 **** ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); CopyAttributeOut(fp, string, delim); ! pfree(string); } else { --- 668,674 ---- ObjectIdGetDatum(elements[attnum - 1]), Int32GetDatum(attr[attnum - 1]->atttypmod))); CopyAttributeOut(fp, string, delim); ! /*pfree(string);*/ } else {
"Nicolai Tufar" <ntufar@apb.com.tr> writes: > (gdb) bt > #0 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 > #1 0x080b37be in CopyTo (rel=0x402be088, attnumlist=0x828a3f8, binary=0 > '\0', oids=0 '\0', > fp=0x0, delim=0x81c2a57 "\t", null_print=0x81b2c22 "\\N") at copy.c:671 > #2 0x080b32f6 in DoCopy (stmt=0x827b398) at copy.c:491 Hm. Offhand it would seem that somebody's output function is returning a non-palloc'd string. What are the datatypes in that table, exactly? regards, tom lane
----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > > Hm. Offhand it would seem that somebody's output function is returning > a non-palloc'd string. What are the datatypes in that table, exactly? apb=> \d maras2.mrk_yyn Table "maras2.mrk_yyn" Column | Type | Modifiers ---------+-----------------------+----------- bsvr_no | character varying(10) | yyn_tur | character varying(1) | yyn_no | numeric | yyn_syf | numeric | yyn_trh | date | apb=> > regards, tom lane Thanks for attention. Regards, Nic
Nicolai Tufar wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >>Hm. Offhand it would seem that somebody's output function is returning >>a non-palloc'd string. What are the datatypes in that table, exactly? > > apb=> \d maras2.mrk_yyn > Table "maras2.mrk_yyn" > Column | Type | Modifiers > ---------+-----------------------+----------- > bsvr_no | character varying(10) | > yyn_tur | character varying(1) | > yyn_no | numeric | > yyn_syf | numeric | > yyn_trh | date | FWIW, I just tried to repeat this on cvs tip and could not. Seems to work just fine: regression=# create table foo(bsvr_no character varying(10),yyn_tur character varying(1),yyn_no numeric,yyn_syf numeric,yyn_trh date); CREATE TABLE regression=# insert into foo values('a','a','1.0','1.0',now()::date); INSERT 556825 1 regression=# copy foo to stdout; a a 1.0 1.0 2002-11-30 regression=# copy foo(bsvr_no) to stdout; a regression=# copy foo(bsvr_no,yyn_tur,yyn_no,yyn_syf,yyn_trh) to stdout; a a 1.0 1.0 2002-11-30 regression=# insert into foo values('a','a','1.0','1.0',null); INSERT 556826 1 regression=# copy foo(bsvr_no,yyn_tur,yyn_no,yyn_syf,yyn_trh) to stdout; a a 1.0 1.0 2002-11-30 a a 1.0 1.0 \N Joe
Joe Conway <mail@joeconway.com> writes: > FWIW, I just tried to repeat this on cvs tip and could not. Seems to work just > fine: I have a sneaking suspicion that it can only be reproduced with Nicolai's locale setting. Not sure exactly what aspect of the locale is causing the problem though ... regards, tom lane
----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > Joe Conway <mail@joeconway.com> writes: > > FWIW, I just tried to repeat this on cvs tip and could not. Seems to work just > > fine: > > I have a sneaking suspicion that it can only be reproduced with Nicolai's > locale setting. Not sure exactly what aspect of the locale is causing > the problem though ... I am trying to reproduce the same on another server with no success. Locale is what also comes to my mind. I have compiled PostgreSQL with --with-nls option. All databases, including template0 are in LATIN5 encoding. My postgresql.conf contains: LC_MESSAGES = 'tr_TR' LC_MONETARY = 'tr_TR' LC_NUMERIC = 'tr_TR' LC_TIME = 'tr_TR' datestyle = 'European' Regards, Nic. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster >
----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > > Can you extract more info? For example, it would be useful to see > exactly which line of the table the COPY dies at, and what's in that > row. COPY dies at the first row and the data is as following: apb=> select * from mrk_yyn limit 3; bsvr_no | yyn_tur | yyn_no | yyn_syf | yyn_trh ---------+---------+--------+---------+--------- 045030 | T | 176 | 39 | 045033 | T | 176 | 39 | 045035 | T | 176 | 39 | (3 rows) apb=> yyn_trh semm to be null. I tried to COPY other tables, the ones without date columns, I created new tables, but it still segfaults. And I can not reproduce it on other system. > > regards, tom lane > Nic
Nicolai Tufar wrote: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >>I have a sneaking suspicion that it can only be reproduced with Nicolai's >>locale setting. Not sure exactly what aspect of the locale is causing >>the problem though ... > > I am trying to reproduce the same on another server with no success. > Locale is what also comes to my mind. I have compiled PostgreSQL > with --with-nls option. All databases, including template0 are in > LATIN5 encoding. My postgresql.conf contains: > > LC_MESSAGES = 'tr_TR' > LC_MONETARY = 'tr_TR' > LC_NUMERIC = 'tr_TR' > LC_TIME = 'tr_TR' > datestyle = 'European' > Here's what I have: $ pg_config --configure '--with-tcl' '--enable-integer-datetimes' '--enable-debug' '--enable-cassert' '--enable-nls' '--enable-depend' 'DOCBOOKSTYLE=/usr/share/sgml/docbook/dsssl-stylesheets-1.76' testl5=# select * from pg_settings where name like 'lc%' or name = 'DateStyle'; name | setting -------------+------------------------------- DateStyle | ISO with European conventions lc_messages | tr_TR lc_monetary | tr_TR lc_numeric | tr_TR lc_time | tr_TR (5 rows) testl5=# \l List of databases Name | Owner | Encoding ------------+----------+----------- parts | postgres | SQL_ASCII regression | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII test | postgres | SQL_ASCII testjp | postgres | EUC_JP testl5 | postgres | LATIN5 (7 rows) testl5=# copy foo(bsvr_no,yyn_tur,yyn_no,yyn_syf,yyn_trh) to stdout; a a 1.0 1.0 2002-11-30 Still works for me. Could it be related to --enable-integer-datetimes? Joe
Ohhh ... Nicolai, are you running with a client encoding different from server encoding? I'm thinking that the real problem is that pg_server_to_client() can return its input pointer as its result, meaning that CopyAttributeOut might try to pfree() its input. regards, tom lane
----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> > Ohhh ... > > Nicolai, are you running with a client encoding different from server > encoding? Got it! Gentlemen, thank you very much for assistance. The body of evidence was slowly growing, then, finaly Tom Lan's message have enlightened me. It all started with initdb's warning that it can not set client's connection to 'LATIN5'. Okay, I said, maybe some system files are not installed. Then I tried to create a database with ENCODING='LATIN5'. I did not pay much attention either since my template1 and template0 are in LATIN5. Then on Tom's suggestion I tried to change client encoding: aaa=# \encoding LATIN5 aaa=# \encoding SQL_ASCII SQL_ASCII: invalid encoding name or conversion procedure not found aaa=# \encoding LATIN5 LATIN5: invalid encoding name or conversion procedure not found aaa=# \encoding SQL_ASCii aaa=# \encoding SQL_ASCII aaa=# \encoding LATiN5 aaa=# \encoding LATIN5 aaa=# So it all falls back to Turkish "I" problem I mentioned earlier. To perform COPY successfully I had to set client locale to 'LATiN5' (note quotes and lower-case "i"). Any other combinations result in segmentation fault. Now, would it be right to change locale name handling to use ASCII-only lower and upper-case conversions? Many thanks to Tom Lane and Joe Conway. I've got to get some sleep :-) Regards, Nic.
"Nicolai Tufar" <ntufar@apb.com.tr> writes: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> Ohhh ... >> >> Nicolai, are you running with a client encoding different from server >> encoding? > Got it! Okay, the problem is a double free of memory in COPY OUT. I've applied the following patch for 7.3.1. (A better fix would be for COPY OUT to run in a local memory context that it could reset every so often, like once per row, and then we could remove these risky retail pfree's entirely. I plan to do that instead in development sources.) regards, tom lane *** src/backend/commands/copy.c.orig Fri Oct 18 20:25:36 2002 --- src/backend/commands/copy.c Sun Dec 1 12:28:24 2002 *************** *** 1470,1478 **** char *string; char c; char delimc = delim[0]; - bool same_encoding; ! char *string_start; int mblen; int i; --- 1470,1477 ---- char *string; char c; char delimc = delim[0]; bool same_encoding; ! char *string_start = NULL; int mblen; int i; *************** *** 1481,1492 **** { string = (char *) pg_server_to_client((unsigned char *) server_string, strlen(server_string)); ! string_start = string; } else { string = server_string; - string_start = NULL; } for (; (c = *string) != '\0'; string += mblen) --- 1480,1491 ---- { string = (char *) pg_server_to_client((unsigned char *) server_string, strlen(server_string)); ! if (string != server_string) ! string_start = string; } else { string = server_string; } for (; (c = *string) != '\0'; string += mblen)
Something is funny about this pfree()....here is a similar crash on a 7.2.3 (gdb) #0 0x00420874 in heap_freetuple () #1 0x004a8390 in acquire_sample_rows () #2 0x004a75c8 in analyze_rel () #3 0x0049f690 in vacuum () #4 0x005585d8 in ProcessUtility () #5 0x00553c78 in pg_exec_query_string () #6 0x00555850 in PostgresMain () #7 0x00524660 in DoBackend () #8 0x00523d08 in BackendStartup () #9 0x00521c18 in ServerLoop () #10 0x005211c8 in PostmasterMain () #11 0x004df3d8 in main () #12 0x2ae34928 in __libc_start_main () from /lib/libc.so.6 (gdb) Nicolai Tufar wrote: >Help! >Backend crashes on any COPY command I issue: >PostgreSQL is 7.3 release, is compiled from sources. >The message I get is: > >apb=# COPY maras2.mrk_yyn (bsvr_no, yyn_tur, yyn_no, yyn_syf, yyn_trh) TO >stdout; >server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. >The connection to the server was lost. Attempting reset: Failed. >!# > > >The output of strace on backend is following: > ># strace -p 18883 >recv(7, "QCOPY maras2.mrk_yyn (bsvr_no, y"..., 8192, 0) = 77 >gettimeofday({1038713111, 629131}, NULL) = 0 >_llseek(11, 0, [0], SEEK_SET) = 0 >read(11, "\0\0\0\0\20\0\0\0\1\0\0\0\24\0\360\37\360\37\1 b1\5\0\1"..., 8192) >= 8192 >read(11, "\0\0\0\0x#\375\"\20\0\0\0008\0h\35\360\37\1 \370\235\220"..., >8192) = 8192 >_llseek(12, 0, [0], SEEK_SET) = 0 >read(12, "\0\0\0\0\0#\375\"\20\0\0\0008\0@\34\0 \1 \200\237\0\1\0"..., 8192) >= 8192 >_llseek(21, 40960, [40960], SEEK_SET) = 0 >read(21, "\0\0\0\0\4\251\3#\20\0\0\0004\4p\17\360\37\1 \340\237 "..., 8192) >= 8192 >_llseek(22, 106496, [106496], SEEK_SET) = 0 >read(22, "\0\0\0\0(\326\377\"\20\0\0\0\10\1\200\1\0 \1 \200\237\0"..., 8192) >= 8192 >read(17, "\0\0\0\0\20\0\0\0\1\0\0\0\360\0\20\1\0 \1 p\237 \1\340"..., 8192) >= 8192 >open("/data/pgsql/base/2015749/2015757", O_RDWR|O_LARGEFILE) = 39 >_llseek(39, 0, [20037632], SEEK_END) = 0 >_llseek(39, 0, [0], SEEK_SET) = 0 >read(39, "\0\0\0\0\270\7\4#\20\0\0\0\300\1\350\1\0 \1 \270\237\216"..., >8192) = 8192 >--- SIGSEGV (Segmentation fault) --- >+++ killed by SIGSEGV +++ ># > >I have recompiled postgres with debug information enabled, run gdb, attached >to backend process >and caught SIGSEGV as following: > >(gdb) continue >Continuing. > >Program received signal SIGSEGV, Segmentation fault. >0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 >480 (*header->context->methods->free_p) (header->context, >pointer); >(gdb) > > >The output of bt is: > >(gdb) bt >#0 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 >#1 0x080b37be in CopyTo (rel=0x402be088, attnumlist=0x828a3f8, binary=0 >'\0', oids=0 '\0', > fp=0x0, delim=0x81c2a57 "\t", null_print=0x81b2c22 "\\N") at copy.c:671 >#2 0x080b32f6 in DoCopy (stmt=0x827b398) at copy.c:491 >#3 0x08118a6a in pg_exec_query_string (query_string=0x827af48, dest=Remote, > parse_context=0x8245378) at postgres.c:789 >#4 0x08119b89 in PostgresMain (argc=4, argv=0xbfffd2a0, username=0x8240a81 >"postgres") > at postgres.c:2016 >#5 0x08101d7c in DoBackend (port=0x8240950) at postmaster.c:2293 >#6 0x081016c2 in BackendStartup (port=0x8240950) at postmaster.c:1915 >#7 0x08100875 in ServerLoop () at postmaster.c:1000 >#8 0x08100326 in PostmasterMain (argc=1, argv=0x82270c0) at >postmaster.c:779 >#9 0x080debab in main (argc=1, argv=0xbfffdc34) at main.c:210 >#10 0x42017589 in __libc_start_main () from /lib/i686/libc.so.6 >(gdb) > > > >Thanks in advance, >Nic > > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > >
Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. --------------------------------------------------------------------------- Medi Montaseri wrote: > Something is funny about this pfree()....here is a similar crash on a 7.2.3 > > (gdb) > #0 0x00420874 in heap_freetuple () > #1 0x004a8390 in acquire_sample_rows () > #2 0x004a75c8 in analyze_rel () > #3 0x0049f690 in vacuum () > #4 0x005585d8 in ProcessUtility () > #5 0x00553c78 in pg_exec_query_string () > #6 0x00555850 in PostgresMain () > #7 0x00524660 in DoBackend () > #8 0x00523d08 in BackendStartup () > #9 0x00521c18 in ServerLoop () > #10 0x005211c8 in PostmasterMain () > #11 0x004df3d8 in main () > #12 0x2ae34928 in __libc_start_main () from /lib/libc.so.6 > (gdb) > > > > Nicolai Tufar wrote: > > >Help! > >Backend crashes on any COPY command I issue: > >PostgreSQL is 7.3 release, is compiled from sources. > >The message I get is: > > > >apb=# COPY maras2.mrk_yyn (bsvr_no, yyn_tur, yyn_no, yyn_syf, yyn_trh) TO > >stdout; > >server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > >The connection to the server was lost. Attempting reset: Failed. > >!# > > > > > >The output of strace on backend is following: > > > ># strace -p 18883 > >recv(7, "QCOPY maras2.mrk_yyn (bsvr_no, y"..., 8192, 0) = 77 > >gettimeofday({1038713111, 629131}, NULL) = 0 > >_llseek(11, 0, [0], SEEK_SET) = 0 > >read(11, "\0\0\0\0\20\0\0\0\1\0\0\0\24\0\360\37\360\37\1 b1\5\0\1"..., 8192) > >= 8192 > >read(11, "\0\0\0\0x#\375\"\20\0\0\0008\0h\35\360\37\1 \370\235\220"..., > >8192) = 8192 > >_llseek(12, 0, [0], SEEK_SET) = 0 > >read(12, "\0\0\0\0\0#\375\"\20\0\0\0008\0@\34\0 \1 \200\237\0\1\0"..., 8192) > >= 8192 > >_llseek(21, 40960, [40960], SEEK_SET) = 0 > >read(21, "\0\0\0\0\4\251\3#\20\0\0\0004\4p\17\360\37\1 \340\237 "..., 8192) > >= 8192 > >_llseek(22, 106496, [106496], SEEK_SET) = 0 > >read(22, "\0\0\0\0(\326\377\"\20\0\0\0\10\1\200\1\0 \1 \200\237\0"..., 8192) > >= 8192 > >read(17, "\0\0\0\0\20\0\0\0\1\0\0\0\360\0\20\1\0 \1 p\237 \1\340"..., 8192) > >= 8192 > >open("/data/pgsql/base/2015749/2015757", O_RDWR|O_LARGEFILE) = 39 > >_llseek(39, 0, [20037632], SEEK_END) = 0 > >_llseek(39, 0, [0], SEEK_SET) = 0 > >read(39, "\0\0\0\0\270\7\4#\20\0\0\0\300\1\350\1\0 \1 \270\237\216"..., > >8192) = 8192 > >--- SIGSEGV (Segmentation fault) --- > >+++ killed by SIGSEGV +++ > ># > > > >I have recompiled postgres with debug information enabled, run gdb, attached > >to backend process > >and caught SIGSEGV as following: > > > >(gdb) continue > >Continuing. > > > >Program received signal SIGSEGV, Segmentation fault. > >0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 > >480 (*header->context->methods->free_p) (header->context, > >pointer); > >(gdb) > > > > > >The output of bt is: > > > >(gdb) bt > >#0 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 > >#1 0x080b37be in CopyTo (rel=0x402be088, attnumlist=0x828a3f8, binary=0 > >'\0', oids=0 '\0', > > fp=0x0, delim=0x81c2a57 "\t", null_print=0x81b2c22 "\\N") at copy.c:671 > >#2 0x080b32f6 in DoCopy (stmt=0x827b398) at copy.c:491 > >#3 0x08118a6a in pg_exec_query_string (query_string=0x827af48, dest=Remote, > > parse_context=0x8245378) at postgres.c:789 > >#4 0x08119b89 in PostgresMain (argc=4, argv=0xbfffd2a0, username=0x8240a81 > >"postgres") > > at postgres.c:2016 > >#5 0x08101d7c in DoBackend (port=0x8240950) at postmaster.c:2293 > >#6 0x081016c2 in BackendStartup (port=0x8240950) at postmaster.c:1915 > >#7 0x08100875 in ServerLoop () at postmaster.c:1000 > >#8 0x08100326 in PostmasterMain (argc=1, argv=0x82270c0) at > >postmaster.c:779 > >#9 0x080debab in main (argc=1, argv=0xbfffdc34) at main.c:210 > >#10 0x42017589 in __libc_start_main () from /lib/i686/libc.so.6 > >(gdb) > > > > > > > >Thanks in advance, > >Nic > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. The double pfree was in COPY, though. This looks to be a different issue. >> (gdb) >> #0 0x00420874 in heap_freetuple () >> #1 0x004a8390 in acquire_sample_rows () >> #2 0x004a75c8 in analyze_rel () >> #3 0x0049f690 in vacuum () >> #4 0x005585d8 in ProcessUtility () >> #5 0x00553c78 in pg_exec_query_string () >> #6 0x00555850 in PostgresMain () regards, tom lane
Ewe, yea. --------------------------------------------------------------------------- Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. > > The double pfree was in COPY, though. This looks to be a different > issue. > > >> (gdb) > >> #0 0x00420874 in heap_freetuple () > >> #1 0x004a8390 in acquire_sample_rows () > >> #2 0x004a75c8 in analyze_rel () > >> #3 0x0049f690 in vacuum () > >> #4 0x005585d8 in ProcessUtility () > >> #5 0x00553c78 in pg_exec_query_string () > >> #6 0x00555850 in PostgresMain () > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Can I have a patch and would it work against 7.2.3..... Thanks Bruce Momjian wrote: >Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. > >--------------------------------------------------------------------------- > >Medi Montaseri wrote: > > >>Something is funny about this pfree()....here is a similar crash on a 7.2.3 >> >>(gdb) >>#0 0x00420874 in heap_freetuple () >>#1 0x004a8390 in acquire_sample_rows () >>#2 0x004a75c8 in analyze_rel () >>#3 0x0049f690 in vacuum () >>#4 0x005585d8 in ProcessUtility () >>#5 0x00553c78 in pg_exec_query_string () >>#6 0x00555850 in PostgresMain () >>#7 0x00524660 in DoBackend () >>#8 0x00523d08 in BackendStartup () >>#9 0x00521c18 in ServerLoop () >>#10 0x005211c8 in PostmasterMain () >>#11 0x004df3d8 in main () >>#12 0x2ae34928 in __libc_start_main () from /lib/libc.so.6 >>(gdb) >> >> >> >>Nicolai Tufar wrote: >> >> >> >>>Help! >>>Backend crashes on any COPY command I issue: >>>PostgreSQL is 7.3 release, is compiled from sources. >>>The message I get is: >>> >>>apb=# COPY maras2.mrk_yyn (bsvr_no, yyn_tur, yyn_no, yyn_syf, yyn_trh) TO >>>stdout; >>>server closed the connection unexpectedly >>> This probably means the server terminated abnormally >>> before or while processing the request. >>>The connection to the server was lost. Attempting reset: Failed. >>>!# >>> >>> >>>The output of strace on backend is following: >>> >>># strace -p 18883 >>>recv(7, "QCOPY maras2.mrk_yyn (bsvr_no, y"..., 8192, 0) = 77 >>>gettimeofday({1038713111, 629131}, NULL) = 0 >>>_llseek(11, 0, [0], SEEK_SET) = 0 >>>read(11, "\0\0\0\0\20\0\0\0\1\0\0\0\24\0\360\37\360\37\1 b1\5\0\1"..., 8192) >>>= 8192 >>>read(11, "\0\0\0\0x#\375\"\20\0\0\0008\0h\35\360\37\1 \370\235\220"..., >>>8192) = 8192 >>>_llseek(12, 0, [0], SEEK_SET) = 0 >>>read(12, "\0\0\0\0\0#\375\"\20\0\0\0008\0@\34\0 \1 \200\237\0\1\0"..., 8192) >>>= 8192 >>>_llseek(21, 40960, [40960], SEEK_SET) = 0 >>>read(21, "\0\0\0\0\4\251\3#\20\0\0\0004\4p\17\360\37\1 \340\237 "..., 8192) >>>= 8192 >>>_llseek(22, 106496, [106496], SEEK_SET) = 0 >>>read(22, "\0\0\0\0(\326\377\"\20\0\0\0\10\1\200\1\0 \1 \200\237\0"..., 8192) >>>= 8192 >>>read(17, "\0\0\0\0\20\0\0\0\1\0\0\0\360\0\20\1\0 \1 p\237 \1\340"..., 8192) >>>= 8192 >>>open("/data/pgsql/base/2015749/2015757", O_RDWR|O_LARGEFILE) = 39 >>>_llseek(39, 0, [20037632], SEEK_END) = 0 >>>_llseek(39, 0, [0], SEEK_SET) = 0 >>>read(39, "\0\0\0\0\270\7\4#\20\0\0\0\300\1\350\1\0 \1 \270\237\216"..., >>>8192) = 8192 >>>--- SIGSEGV (Segmentation fault) --- >>>+++ killed by SIGSEGV +++ >>># >>> >>>I have recompiled postgres with debug information enabled, run gdb, attached >>>to backend process >>>and caught SIGSEGV as following: >>> >>>(gdb) continue >>>Continuing. >>> >>>Program received signal SIGSEGV, Segmentation fault. >>>0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 >>>480 (*header->context->methods->free_p) (header->context, >>>pointer); >>>(gdb) >>> >>> >>>The output of bt is: >>> >>>(gdb) bt >>>#0 0x081697bf in pfree (pointer=0x828a7b0) at mcxt.c:480 >>>#1 0x080b37be in CopyTo (rel=0x402be088, attnumlist=0x828a3f8, binary=0 >>>'\0', oids=0 '\0', >>> fp=0x0, delim=0x81c2a57 "\t", null_print=0x81b2c22 "\\N") at copy.c:671 >>>#2 0x080b32f6 in DoCopy (stmt=0x827b398) at copy.c:491 >>>#3 0x08118a6a in pg_exec_query_string (query_string=0x827af48, dest=Remote, >>> parse_context=0x8245378) at postgres.c:789 >>>#4 0x08119b89 in PostgresMain (argc=4, argv=0xbfffd2a0, username=0x8240a81 >>>"postgres") >>> at postgres.c:2016 >>>#5 0x08101d7c in DoBackend (port=0x8240950) at postmaster.c:2293 >>>#6 0x081016c2 in BackendStartup (port=0x8240950) at postmaster.c:1915 >>>#7 0x08100875 in ServerLoop () at postmaster.c:1000 >>>#8 0x08100326 in PostmasterMain (argc=1, argv=0x82270c0) at >>>postmaster.c:779 >>>#9 0x080debab in main (argc=1, argv=0xbfffdc34) at main.c:210 >>>#10 0x42017589 in __libc_start_main () from /lib/i686/libc.so.6 >>>(gdb) >>> >>> >>> >>>Thanks in advance, >>>Nic >>> >>> >>>---------------------------(end of broadcast)--------------------------- >>>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >>> >>> >>> >>> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 5: Have you checked our extensive FAQ? >> >>http://www.postgresql.org/users-lounge/docs/faq.html >> >> >> > > >
FYI, I'm experiencing this with Async Queries (Async Purge and Aysnc Vacuum). Tom Lane wrote: >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >>Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. >> >> > >The double pfree was in COPY, though. This looks to be a different >issue. > > > >>>(gdb) >>>#0 0x00420874 in heap_freetuple () >>>#1 0x004a8390 in acquire_sample_rows () >>>#2 0x004a75c8 in analyze_rel () >>>#3 0x0049f690 in vacuum () >>>#4 0x005585d8 in ProcessUtility () >>>#5 0x00553c78 in pg_exec_query_string () >>>#6 0x00555850 in PostgresMain () >>> >>> > > regards, tom lane > >
I was wrong. We found a COPY bug in 7.3 that will be fixed in 7.3.1. Would you compile with symbols, -g, and send a new backtrace? --------------------------------------------------------------------------- Medi Montaseri wrote: > FYI, I'm experiencing this with Async Queries (Async Purge and Aysnc > Vacuum). > > Tom Lane wrote: > > >Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > > > >>Yes, we found a double pfree in 7.3. There will be a fix in 7.3.1. > >> > >> > > > >The double pfree was in COPY, though. This looks to be a different > >issue. > > > > > > > >>>(gdb) > >>>#0 0x00420874 in heap_freetuple () > >>>#1 0x004a8390 in acquire_sample_rows () > >>>#2 0x004a75c8 in analyze_rel () > >>>#3 0x0049f690 in vacuum () > >>>#4 0x005585d8 in ProcessUtility () > >>>#5 0x00553c78 in pg_exec_query_string () > >>>#6 0x00555850 in PostgresMain () > >>> > >>> > > > > regards, tom lane > > > > > > > > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
I encountered a reproducable backend crash in the following situation: - while running a conversion tool written in TrollTech's Qt that uses libpq - apart from the tool, only one inactive psql-client is connected to the same database (for extra monitoring) - during the conversion session this tool creates a number of temporary tables The backend crashes when at the end of the conversion a 'vacuum full analyse' is started, more specifically it dies upon doing work on one of these temporary tables.... Obviously, no problem is encountered when vacuuming *after* the tool has finished. Is this known behaviour? Otherwise, I'd be glad to assist with a backtrace. Regards, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > Is this known behaviour? Nope. > Otherwise, I'd be glad to assist with a backtrace. Please. regards, tom lane
> Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > > Is this known behaviour? > TL> Nope. > > > Otherwise, I'd be glad to assist with a backtrace. > TL> Please. ** PostgreSQL v7.3 configured with: ./configure --prefix=/usr/src/postgresql_debug/pgsql --enable-debug --with-openssl --with-perl --with-python --with-tcl --with-pgport=5433 ** while using a client written in TrollTech's QT v3.1 that creates a number of temporary tables, uses a number lot of 'copy' statements and at the end tries to 'vacuum full analyse' the database ** the backend crashes.... ;( ** last information received by the frontend: INFO: --Relation pg_temp_2.t_article_details-- Warning: QSqlDatabaseManager unable to open database: FATAL: The database system is in recovery mode ** backend's log: LOG: query: update stock_pick set amount = case when pallet_amount > amount_stock then amount_stock else pallet_amount end from t_stock where stock_pick.article_id = t_stock.id LOG: recycled transaction log file 0000000000000017 LOG: query: update t_stock set amount_stock = case when pallet_amount > amount_stock then 0 else amount_stock - pallet_amount end where exists (select 1 from stock_pick where article_id = t_stock.id) LOG: query: insert into stock_bulk select 5, id, amount_stock, 0, 1, false, null from t_stock LOG: query: SELECT 1 FROM ONLY "public"."location" x WHERE "id" = $1 FOR UPDATE OF x LOG: query: SELECT 1 FROM ONLY "public"."article" x WHERE "id" = $1 FOR UPDATE OF x LOG: query: SELECT 1 FROM ONLY "public"."pallet_type" x WHERE "id" = $1 FOR UPDATE OF x LOG: query: truncate table t_stock LOG: query: drop table aux_country, aux_language, aux_purchaser, aux_salesperson, aux_valuta, aux_address LOG: query: drop index temp01, temp02, temp03 LOG: query: vacuum full verbose analyse LOG: recycled transaction log file 000000000000001A LOG: recycled transaction log file 0000000000000018 LOG: recycled transaction log file 0000000000000019 LOG: server process (pid 32663) was terminated by signal 11 LOG: terminating any other active server processes WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query. LOG: connection received: host=192.168.0.79 port=33762 FATAL: The database system is in recovery mode LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2002-12-05 12:22:50 CET LOG: checkpoint record is at 0/1DAEA234 LOG: redo record is at 0/1D8D2C10; undo record is at 0/0; shutdown FALSE LOG: next transaction id: 5145; next oid: 451153 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/1D8D2C10 LOG: ReadRecord: record with zero length at 0/1DB74550 LOG: redo done at 0/1DB7452C LOG: recycled transaction log file 000000000000001B LOG: recycled transaction log file 000000000000001C LOG: database system is ready ** backtrace of the coredump: #0 _mdfd_getrelnfd (reln=0x0) at md.c:854 854 fd = RelationGetFile(reln); (gdb) bt #0 _mdfd_getrelnfd (reln=0x0) at md.c:854 #1 0x8123a0f in _mdfd_getseg (reln=0x0, blkno=13) at md.c:878 #2 0x812349d in mdwrite (reln=0x0, blocknum=13, buffer=0x848dde8 "") at md.c:461 #3 0x8123e31 in smgrwrite (which=0, reln=0x0, blocknum=13, buffer=0x848dde8 "") at smgr.c:328 #4 0x811be2a in LocalBufferAlloc (reln=0x413979d0, blockNum=0, foundPtr=0xbfffebb7 "¿Ìëÿ¿@§\021\bÐy9A") at localbuf.c:100 #5 0x811a7c2 in ReadBufferInternal (reln=0x413979d0, blockNum=0, bufferLockHeld=0 '\000') at bufmgr.c:144 #6 0x811a740 in ReadBuffer (reln=0x413979d0, blockNum=0) at bufmgr.c:112 #7 0x80d7d19 in scan_heap (vacrelstats=0x8389430, onerel=0x413979d0, vacuum_pages=0xbfffed50, fraged_pages=0xbfffed40) at vacuum.c:1069 #8 0x80d7a79 in full_vacuum_rel (onerel=0x413979d0, vacstmt=0x8313858) at vacuum.c:910 #9 0x80d799e in vacuum_rel (relid=374498, vacstmt=0x8313858, expected_relkind=114 'r') at vacuum.c:827 #10 0x80d73a3 in vacuum (vacstmt=0x8313858) at vacuum.c:290 #11 0x8126ff6 in ProcessUtility (parsetree=0x8313858, dest=Remote, completionTag=0xbfffee30 "") at utility.c:713 #12 0x8124dcc in pg_exec_query_string (query_string=0x8313688, dest=Remote, parse_context=0x82ddb70) at postgres.c:789 #13 0x8125df7 in PostgresMain (argc=4, argv=0xbffff058, username=0x82d4a81 "postgres") at postgres.c:2016 #14 0x810eab9 in DoBackend (port=0x82d4950) at postmaster.c:2293 #15 0x810e467 in BackendStartup (port=0x82d4950) at postmaster.c:1915 #16 0x810d7d4 in ServerLoop () at postmaster.c:1018 #17 0x810d331 in PostmasterMain (argc=4, argv=0x82bce00) at postmaster.c:779 #18 0x80ee909 in main (argc=4, argv=0xbffff974) at main.c:210 #19 0x400d4577 in __libc_start_main () from /lib/libc.so.6 (gdb) If you need more details on machine/tables/etc. just let me know. -> the backend doesn't crash if the vacuum is done through psql AFTER the clients work is done (and the temporary tables are automatically destroyed) Regards, Frank.
Frank van Vugt <ftm.van.vugt@foxi.nl> writes: > ** while using a client written in TrollTech's QT v3.1 that creates a number > of temporary tables, uses a number lot of 'copy' statements and at the end > tries to 'vacuum full analyse' the database Drat --- looks like the local buffer manager code has gotten out of sync with the relcache code. In particular I note the following in localbuf.c Relation bufrel = RelationNodeCacheGetRelation(bufHdr->tag.rnode); /* * The relcache is not supposed to throw away temp rels, so this * should always succeed. */ Assert(bufrel != NULL); which was true at the time it was written, but is so no longer :-( Will fix. regards, tom lane
I said: > Frank van Vugt <ftm.van.vugt@foxi.nl> writes: >> ** while using a client written in TrollTech's QT v3.1 that creates a number >> of temporary tables, uses a number lot of 'copy' statements and at the end >> tries to 'vacuum full analyse' the database > Drat --- looks like the local buffer manager code has gotten out of sync > with the relcache code. Here is the fix if you need to patch this locally. regards, tom lane *** src/backend/storage/buffer/localbuf.c.orig Wed Sep 4 16:31:25 2002 --- src/backend/storage/buffer/localbuf.c Thu Dec 5 17:48:10 2002 *************** *** 90,108 **** { Relation bufrel = RelationNodeCacheGetRelation(bufHdr->tag.rnode); - /* - * The relcache is not supposed to throw away temp rels, so this - * should always succeed. - */ - Assert(bufrel != NULL); - /* flush this page */ ! smgrwrite(DEFAULT_SMGR, bufrel, bufHdr->tag.blockNum, ! (char *) MAKE_PTR(bufHdr->data)); ! LocalBufferFlushCount++; ! /* drop refcount incremented by RelationNodeCacheGetRelation */ ! RelationDecrementReferenceCount(bufrel); } /* --- 90,113 ---- { Relation bufrel = RelationNodeCacheGetRelation(bufHdr->tag.rnode); /* flush this page */ ! if (bufrel == (Relation) NULL) ! { ! smgrblindwrt(DEFAULT_SMGR, ! bufHdr->tag.rnode, ! bufHdr->tag.blockNum, ! (char *) MAKE_PTR(bufHdr->data)); ! } ! else ! { ! smgrwrite(DEFAULT_SMGR, bufrel, ! bufHdr->tag.blockNum, ! (char *) MAKE_PTR(bufHdr->data)); ! /* drop refcount incremented by RelationNodeCacheGetRelation */ ! RelationDecrementReferenceCount(bufrel); ! } ! LocalBufferFlushCount++; } /*