Thread: Re: [GENERAL] Still big problems with pg_dump!
-hackers removed. On Tue, Sep 17, 2002 at 10:11:41AM +0200, Wim wrote: > ERROR: AllocSetFree: cannot find block containing chunk 4c5ad0 This is definitely some sort of disk problem. Either you've written bad data to the disk in some way, or else the disk is corrupted or damaged. If it is a hardware problem, the obvious suspects are memory (I'd discount this idea unless everything else doesn't check out), a disk failure, or a controller failure. It could be OS related as well. Several of the 2.4 Linux kernel series, for instance, had roblems with massive filesystem corruption. > Some people suggest a drive failure, but I checked that and found no > problems... How did you check? > I must say that one of the table contains more than 3.000.000 rows, > another more than 1.400.000... When is your most recent backup? If you can't pg_dump, you will be needing that backup. > I must say that I had this problem a few months before, I got some help > then, but that couldn't solve my problem, > I recreated the database from scratch and copied the data, to fix thing > quickly. Thing went well for about two months :-( So you re-installed the data set on a machine that had somehow failed, you don't know why, and hoped that the problem would solve itself? Uh, that wasn't a good idea. In the future, if you have a problem which people suggest might be, for instance, a bad disk, it'd be a _very good_ idea to figure out precisely what the problem is before relying on the identical hardware again. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote: > > > >>ERROR: AllocSetFree: cannot find block containing chunk 4c5ad0 > >> > >> > > > >This is definitely some sort of disk problem. Either you've written > >bad data to the disk in some way, or else the disk is corrupted or > >damaged. > > > Postgres is running on solaris 8... > It is the same database as previous time that has the problem, but not > the same table. Someone else suggested that this would not be the error when you have written bad data to the disk (I thought you could have this if the controller was flakey and wrote bad data in the past. Maybe I'm wrong. Probably). > >How did you check? > > > > > with fsck. That won't help you if the controller is coming and going; you might find that it works one time, and not another. Indeed, a disk on its way out can even pass fsck sometimes, although it's pretty unusual. > Have backup... I can still SQL COPY to a text file, so that's no problem > so far. Well, that's good. I'd suggest backing up _really often_ until you know what the problem is, especially since this is production. > I know, I don't have much spare hardware, and the database had to work > quickly, it was the > only solution then. > Checked the disk, reinstalled the OS and still waiting for a CPU and > memory upgrade. Do you have another place to store the database in the meantime -- an Intel box with a cheap disk, or anything? At least you'd have another copy of the database somewhere that way. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan wrote: <snip>... >>>How did you check? >>> >>> >>> >>> >>with fsck. >> >> > >That won't help you if the controller is coming and going; you might >find that it works one time, and not another. Indeed, a disk on its >way out can even pass fsck sometimes, although it's pretty unusual. > The DB is located on a RAID5 disk array... > > >>Have backup... I can still SQL COPY to a text file, so that's no problem >>so far. >> >> > >Well, that's good. I'd suggest backing up _really often_ until you >know what the problem is, especially since this is production. > > > >>I know, I don't have much spare hardware, and the database had to work >>quickly, it was the >>only solution then. >>Checked the disk, reinstalled the OS and still waiting for a CPU and >>memory upgrade. >> >> > >Do you have another place to store the database in the meantime -- an >Intel box with a cheap disk, or anything? At least you'd have >another copy of the database somewhere that way. > >A > > > Don't have a disk that can store my database... Still searching.... Cheers! Wim
On Tue, Sep 17, 2002 at 04:46:00PM +0200, Wim wrote: > > > >That won't help you if the controller is coming and going; you might > >find that it works one time, and not another. Indeed, a disk on its > >way out can even pass fsck sometimes, although it's pretty unusual. > > > The DB is located on a RAID5 disk array... Hmm. _That's_ interesting. I'd bet on a flakey controller, then. Is it hardware RAID? (I assume so.) A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan <andrew@libertyrms.info> writes: > On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote: >> ERROR: AllocSetFree: cannot find block containing chunk 4c5ad0 >> >> Postgres is running on solaris 8... > Someone else suggested that this would not be the error when you have > written bad data to the disk (I thought you could have this if the > controller was flakey and wrote bad data in the past. Maybe I'm > wrong. Probably). Actually, what it looks like to me is a memory clobber; I don't think bad data on disk would be likely to lead to this particular type of failure. But writing one byte too many into a string, and thereby zeroing the high-order byte of an adjacent pointer, could lead to exactly this message when we later try to pfree() the pointer. I am wondering if Wim is running into that same Solaris snprintf() bug that we discovered awhile back --- it was not clear if the bug still exists in Solaris 8, but the symptoms sure match. See http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php It would be useful to see a stack traceback from the point of the error, if possible. regards, tom lane
Andrew Sullivan wrote: >On Tue, Sep 17, 2002 at 04:46:00PM +0200, Wim wrote: > > >>>That won't help you if the controller is coming and going; you might >>>find that it works one time, and not another. Indeed, a disk on its >>>way out can even pass fsck sometimes, although it's pretty unusual. >>> >>> >>> >>The DB is located on a RAID5 disk array... >> >> > >Hmm. _That's_ interesting. I'd bet on a flakey controller, then. >Is it hardware RAID? (I assume so.) > >A > > > Yep, hardware RAID, infact it's a SUN T3 disk array with 9*36GB SCSI disks... Cheers! Wim
On Tue, Sep 17, 2002 at 10:55:18AM -0400, Tom Lane wrote: > > I am wondering if Wim is running into that same Solaris snprintf() bug > that we discovered awhile back --- it was not clear if the bug still > exists in Solaris 8, but the symptoms sure match. See > http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php Hmm, good point. That was only a problem when compiled with the 64-bit libraries, IIRC. Wim, what does 'file postmaster' say? A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Tom Lane wrote: >Andrew Sullivan <andrew@libertyrms.info> writes: > > >>On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote: >> >> >>>ERROR: AllocSetFree: cannot find block containing chunk 4c5ad0 >>> >>>Postgres is running on solaris 8... >>> >>> > > > >>Someone else suggested that this would not be the error when you have >>written bad data to the disk (I thought you could have this if the >>controller was flakey and wrote bad data in the past. Maybe I'm >>wrong. Probably). >> >> > >Actually, what it looks like to me is a memory clobber; I don't think >bad data on disk would be likely to lead to this particular type of >failure. But writing one byte too many into a string, and thereby >zeroing the high-order byte of an adjacent pointer, could lead to >exactly this message when we later try to pfree() the pointer. > >I am wondering if Wim is running into that same Solaris snprintf() bug >that we discovered awhile back --- it was not clear if the bug still >exists in Solaris 8, but the symptoms sure match. See >http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php > >It would be useful to see a stack traceback from the point of the error, >if possible. > > regards, tom lane > I Would like to send a stack traceback, but I need some halp on this (never done this before). some add. info: SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r'; gives: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. SELECT relname FROM pg_class; works well... SELECT relname, relkind from pg_class WHERE relkind='r'; works also... SELECT relname, relkind from pg_class WHERE relname like 'pg_%'; produces the same error as above... > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > > > > Cheers! Wim
Wim <wdh@belbone.be> writes: > Tom Lane wrote: >> It would be useful to see a stack traceback from the point of the error, >> if possible. > I Would like to send a stack traceback, but I need some halp on this > (never done this before). > some add. info: > SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r'; > gives: > server closed the connection unexpectedly This should be producing a core file in your database directory ($PGDATA/base/yourdboid/). With gdb you'd do gdb /path/to/postgres-executable /path/to/corefile gdb> bt gdb> quit I don't remember the equivalent incantations with Solaris' debugger. regards, tom lane
On Tue, Sep 17, 2002 at 05:04:23PM +0200, Wim wrote: > > SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r'; > gives: > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > > SELECT relname, relkind from pg_class WHERE relname like 'pg_%'; > produces the same error as above... That does rather suggest a memory clobber, as Tom suggested. Wim's 'file postmaster' shows it's a 32-bit binary, though, and I verified in my notes that the snprintf bug was only in the 64-bit library. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Tue, Sep 17, 2002 at 11:08:46AM -0400, Tom Lane wrote: > This should be producing a core file in your database directory > ($PGDATA/base/yourdboid/). With gdb you'd do > gdb /path/to/postgres-executable /path/to/corefile > gdb> bt > gdb> quit > I don't remember the equivalent incantations with Solaris' debugger. I think it's adb /path/to/postgres-executable /path/to/corefile $c [or] $C A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
gdb gives me this: bash-2.05$ adb /usr/local/pgsql/bin/postgres /data/postgres/base/17903709/core core file = /data/postgres/base/17903709/core -- program ``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60 SIGBUS: Bus Error $c AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13) MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00) MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541) make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1) prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b, ffbfe6a0) patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728) likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809) OidFunctionCall4+0x124(71b, 4c0780, 4b7, 4c7b90, 1, ffbfe7e1) restriction_selectivity+0x64(4c0780, 4b7, 4c7b90, 1, 0, ffbfe8a0) clauselist_selectivity+0x164(4c0780, 4c8478, 1, 0, 0, 4c8295) restrictlist_selectivity+0x2c(4c0780, 4c7c18, 1, 0, 0, ff0000) set_baserel_size_estimates+0x2c(4c0780, 4c7d70, ffffffff, fffffff8, 0, 4c83f9) set_plain_rel_pathlist+0x18(4c0780, 4c7d70, 4c0808, 53, 4c0348, 20) set_base_rel_pathlists+0xf8(4c0780, 4c8460, 1, 0, 0, 4c7c00) make_one_rel+0xc(4c0780, 0, ffbfecb8, ffbfecb0, 0, 0) subplanner+0x148(4c0780, 4c7cb8, 0, 0, 0, 0) query_planner+0x98(4c0780, 4c7a48, 0, 0, 0, 0) grouping_planner+0x7cc(4c0780, bff00000, 0, ff13a000, 0, 0) subquery_planner+0x260(4c0780, bff00000, 0, 7efefeff, 81010100, ff0000) planner+0x54(4c0780, 4c1e00, 4c15e0, fffffff8, 0, ffbfffd5) pg_plan_query+0x54(4c0780, 29b99c, 0, 53, 4c0348, 20) pg_exec_query_string+0x388(4c0348, 2, 476010, 4c0330, 800000, 0) PostgresMain+0x1398(5, ffbff2d0, 40c1d9, 473, 0, ffbff1c8) DoBackend+0x7d8(40c0a8, 1, 22730, 1552dc, 0, 40c2d9) BackendStartup+0xb0(40c0a8, 5, ffbff800, ffbff5d8, ffbff658, 0) ServerLoop+0x370(297024, 49a0, 0, 3f1f88, 297004, 2d560000) PostmasterMain+0xbe4(5, 3f2980, 65720000, 0, 65720000, 65720000) main+0x294(5, ffbffd8c, ffbffda4, 3e64c0, 0, 0) _start+0x5c(0, 0, 0, 0, 0, 0) $C ffbfe340 AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13) ffbfe3e0 MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00) ffbfe450 MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541) ffbfe4c8 make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1) ffbfe548 prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b, ffbfe6a0) ffbfe5d8 patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728) ffbfe6b0 likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809) ffbfe728 OidFunctionCall4+0x124(71b, 4c0780, 4b7, 4c7b90, 1, ffbfe7e1) ffbfe828 restriction_selectivity+0x64(4c0780, 4b7, 4c7b90, 1, 0, ffbfe8a0) ffbfe8b0 clauselist_selectivity+0x164(4c0780, 4c8478, 1, 0, 0, 4c8295) ffbfe958 restrictlist_selectivity+0x2c(4c0780, 4c7c18, 1, 0, 0, ff0000) ffbfe9d8 set_baserel_size_estimates+0x2c(4c0780, 4c7d70, ffffffff, fffffff8, 0, 4c83f9) ffbfea48 set_plain_rel_pathlist+0x18(4c0780, 4c7d70, 4c0808, 53, 4c0348, 20) ffbfeab8 set_base_rel_pathlists+0xf8(4c0780, 4c8460, 1, 0, 0, 4c7c00) ffbfeb40 make_one_rel+0xc(4c0780, 0, ffbfecb8, ffbfecb0, 0, 0) ffbfebb8 subplanner+0x148(4c0780, 4c7cb8, 0, 0, 0, 0) ffbfec70 query_planner+0x98(4c0780, 4c7a48, 0, 0, 0, 0) ffbfecf8 grouping_planner+0x7cc(4c0780, bff00000, 0, ff13a000, 0, 0) ffbfedc8 subquery_planner+0x260(4c0780, bff00000, 0, 7efefeff, 81010100, ff0000) ffbfee58 planner+0x54(4c0780, 4c1e00, 4c15e0, fffffff8, 0, ffbfffd5) ffbfeed8 pg_plan_query+0x54(4c0780, 29b99c, 0, 53, 4c0348, 20) ffbfef50 pg_exec_query_string+0x388(4c0348, 2, 476010, 4c0330, 800000, 0) ffbff038 PostgresMain+0x1398(5, ffbff2d0, 40c1d9, 473, 0, ffbff1c8) ffbff0f8 DoBackend+0x7d8(40c0a8, 1, 22730, 1552dc, 0, 40c2d9) ffbff4e8 BackendStartup+0xb0(40c0a8, 5, ffbff800, ffbff5d8, ffbff658, 0) ffbff568 ServerLoop+0x370(297024, 49a0, 0, 3f1f88, 297004, 2d560000) ffbff810 PostmasterMain+0xbe4(5, 3f2980, 65720000, 0, 65720000, 65720000) ffbffca0 main+0x294(5, ffbffd8c, ffbffda4, 3e64c0, 0, 0) ffbffd28 _start+0x5c(0, 0, 0, 0, 0, 0) Andrew Sullivan wrote: >On Tue, Sep 17, 2002 at 11:08:46AM -0400, Tom Lane wrote: > > >>This should be producing a core file in your database directory >>($PGDATA/base/yourdboid/). With gdb you'd do >> gdb /path/to/postgres-executable /path/to/corefile >> gdb> bt >> gdb> quit >>I don't remember the equivalent incantations with Solaris' debugger. >> > >I think it's > >adb /path/to/postgres-executable /path/to/corefile >$c > >[or] > >$C > >A > >
Wim <wdh@belbone.be> writes: > gdb gives me this: > bash-2.05$ adb /usr/local/pgsql/bin/postgres > /data/postgres/base/17903709/core > core file = /data/postgres/base/17903709/core -- program > ``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60 > SIGBUS: Bus Error > $c > AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13) > MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00) > MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541) > make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1) > prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b, > ffbfe6a0) > patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728) > likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809) Hm. Are you running in a multibyte character encoding? I had a note that make_greater_string may have problems in the MULTIBYTE case. regards, tom lane
Tom Lane wrote: >Wim <wdh@belbone.be> writes: > > >>gdb gives me this: >>bash-2.05$ adb /usr/local/pgsql/bin/postgres >>/data/postgres/base/17903709/core >>core file = /data/postgres/base/17903709/core -- program >>``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60 >>SIGBUS: Bus Error >>$c >>AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13) >>MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00) >>MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541) >>make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1) >>prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b, >>ffbfe6a0) >>patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728) >>likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809) >> >> > >Hm. Are you running in a multibyte character encoding? I had a note >that make_greater_string may have problems in the MULTIBYTE case. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > Yes, I compiled postgres with multibyte and ODBC support . Is there a workaround possibel to fix my problem? Cheers! Wim
Wim <wdh@belbone.be> writes: >> Hm. Are you running in a multibyte character encoding? I had a note >> that make_greater_string may have problems in the MULTIBYTE case. > Yes, I compiled postgres with multibyte and ODBC support . But are you actually *using* the multibyte code? What does "psql -l" show as the encoding for your database? regards, tom lane
Tom Lane wrote: >Wim <wdh@belbone.be> writes: > >>>Hm. Are you running in a multibyte character encoding? I had a note >>>that make_greater_string may have problems in the MULTIBYTE case. >>> > >>Yes, I compiled postgres with multibyte and ODBC support . >> > >But are you actually *using* the multibyte code? What does "psql -l" >show as the encoding for your database? > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 5: Have you checked our extensive FAQ? > >http://www.postgresql.org/users-lounge/docs/faq.html > > > psql -l shows: List of databases Name | Owner | Encoding ---------------+----------+----------- addressIP | postgres | SQL_ASCII belbonedb_v2 | postgres | SQL_ASCII belbonedb_v21 | postgres | SQL_ASCII peering | postgres | SQL_ASCII peering_v2 | postgres | SQL_ASCII postgres | postgres | SQL_ASCII smsbilling | postgres | SQL_ASCII template0 | postgres | SQL_ASCII template1 | postgres | SQL_ASCII (9 rows) Maybe I should recompile Postgres without multibyte support... Cheers! Wim