Thread: Re: [GENERAL] Still big problems with pg_dump!

Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
-hackers removed.

On Tue, Sep 17, 2002 at 10:11:41AM +0200, Wim wrote:

> ERROR:  AllocSetFree: cannot find block containing chunk 4c5ad0

This is definitely some sort of disk problem.  Either you've written
bad data to the disk in some way, or else the disk is corrupted or
damaged.

If it is a hardware problem, the obvious suspects are memory (I'd
discount this idea unless everything else doesn't check out), a disk
failure, or a controller failure.

It could be OS related as well.  Several of the 2.4 Linux kernel
series, for instance, had roblems with massive filesystem corruption.

> Some people suggest a drive failure, but I checked that and found no
> problems...

How did you check?

> I must say that one of the table contains more than 3.000.000 rows,
> another more than 1.400.000...

When is your most recent backup?  If you can't pg_dump, you will be
needing that backup.

> I must say that I had this problem a few months before, I got some help
> then, but that couldn't solve my problem,
> I recreated the database from scratch and copied the data, to fix thing
> quickly. Thing went well for about two months :-(

So you re-installed the data set on a machine that had somehow
failed, you don't know why, and hoped that the problem would
solve itself?  Uh, that wasn't a good idea.  In the future, if you
have a problem which people suggest might be, for instance, a bad
disk, it'd be a _very good_ idea to figure out precisely what the
problem is before relying on the identical hardware again.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote:
> >
> >>ERROR:  AllocSetFree: cannot find block containing chunk 4c5ad0
> >>
> >>
> >
> >This is definitely some sort of disk problem.  Either you've written
> >bad data to the disk in some way, or else the disk is corrupted or
> >damaged.
> >
> Postgres is running on solaris 8...
> It is the same database as previous time that has the problem, but not
> the same table.

Someone else suggested that this would not be the error when you have
written bad data to the disk (I thought you could have this if the
controller was flakey and wrote bad data in the past.  Maybe I'm
wrong.  Probably).

> >How did you check?
> >
> >
> with fsck.

That won't help you if the controller is coming and going; you might
find that it works one time, and not another.  Indeed, a disk on its
way out can even pass fsck sometimes, although it's pretty unusual.

> Have backup... I can still SQL COPY to a text file, so that's no problem
> so far.

Well, that's good.  I'd suggest backing up _really often_ until you
know what the problem is, especially since this is production.

> I know, I don't have much spare hardware, and the database had to work
> quickly, it was the
> only solution then.
> Checked the disk, reinstalled the OS and still waiting for a CPU and
> memory upgrade.

Do you have another place to store the database in the meantime -- an
Intel box with a cheap disk, or anything?  At least you'd have
another copy of the database somewhere that way.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:

Andrew Sullivan wrote:

<snip>...

>>>How did you check?
>>>
>>>
>>>
>>>
>>with fsck.
>>
>>
>
>That won't help you if the controller is coming and going; you might
>find that it works one time, and not another.  Indeed, a disk on its
>way out can even pass fsck sometimes, although it's pretty unusual.
>
The DB is located on a RAID5 disk array...

>
>
>>Have backup... I can still SQL COPY to a text file, so that's no problem
>>so far.
>>
>>
>
>Well, that's good.  I'd suggest backing up _really often_ until you
>know what the problem is, especially since this is production.
>
>
>
>>I know, I don't have much spare hardware, and the database had to work
>>quickly, it was the
>>only solution then.
>>Checked the disk, reinstalled the OS and still waiting for a CPU and
>>memory upgrade.
>>
>>
>
>Do you have another place to store the database in the meantime -- an
>Intel box with a cheap disk, or anything?  At least you'd have
>another copy of the database somewhere that way.
>
>A
>
>
>
Don't have a disk that can store my database...


Still searching....


Cheers!

Wim


Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
On Tue, Sep 17, 2002 at 04:46:00PM +0200, Wim wrote:
> >
> >That won't help you if the controller is coming and going; you might
> >find that it works one time, and not another.  Indeed, a disk on its
> >way out can even pass fsck sometimes, although it's pretty unusual.
> >
> The DB is located on a RAID5 disk array...

Hmm.  _That's_ interesting.  I'd bet on a flakey controller, then.
Is it hardware RAID?  (I assume so.)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Tom Lane
Date:
Andrew Sullivan <andrew@libertyrms.info> writes:
> On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote:
>> ERROR:  AllocSetFree: cannot find block containing chunk 4c5ad0
>>
>> Postgres is running on solaris 8...

> Someone else suggested that this would not be the error when you have
> written bad data to the disk (I thought you could have this if the
> controller was flakey and wrote bad data in the past.  Maybe I'm
> wrong.  Probably).

Actually, what it looks like to me is a memory clobber; I don't think
bad data on disk would be likely to lead to this particular type of
failure.  But writing one byte too many into a string, and thereby
zeroing the high-order byte of an adjacent pointer, could lead to
exactly this message when we later try to pfree() the pointer.

I am wondering if Wim is running into that same Solaris snprintf() bug
that we discovered awhile back --- it was not clear if the bug still
exists in Solaris 8, but the symptoms sure match.  See
http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php

It would be useful to see a stack traceback from the point of the error,
if possible.

            regards, tom lane

Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:

Andrew Sullivan wrote:

>On Tue, Sep 17, 2002 at 04:46:00PM +0200, Wim wrote:
>
>
>>>That won't help you if the controller is coming and going; you might
>>>find that it works one time, and not another.  Indeed, a disk on its
>>>way out can even pass fsck sometimes, although it's pretty unusual.
>>>
>>>
>>>
>>The DB is located on a RAID5 disk array...
>>
>>
>
>Hmm.  _That's_ interesting.  I'd bet on a flakey controller, then.
>Is it hardware RAID?  (I assume so.)
>
>A
>
>
>
Yep, hardware RAID, infact it's a SUN T3 disk array  with 9*36GB SCSI
disks...

Cheers!

Wim


Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
On Tue, Sep 17, 2002 at 10:55:18AM -0400, Tom Lane wrote:
>
> I am wondering if Wim is running into that same Solaris snprintf() bug
> that we discovered awhile back --- it was not clear if the bug still
> exists in Solaris 8, but the symptoms sure match.  See
> http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php

Hmm, good point.  That was only a problem when compiled with the
64-bit libraries, IIRC.  Wim, what does 'file postmaster' say?

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:

Tom Lane wrote:

>Andrew Sullivan <andrew@libertyrms.info> writes:
>
>
>>On Tue, Sep 17, 2002 at 04:25:45PM +0200, Wim wrote:
>>
>>
>>>ERROR:  AllocSetFree: cannot find block containing chunk 4c5ad0
>>>
>>>Postgres is running on solaris 8...
>>>
>>>
>
>
>
>>Someone else suggested that this would not be the error when you have
>>written bad data to the disk (I thought you could have this if the
>>controller was flakey and wrote bad data in the past.  Maybe I'm
>>wrong.  Probably).
>>
>>
>
>Actually, what it looks like to me is a memory clobber; I don't think
>bad data on disk would be likely to lead to this particular type of
>failure.  But writing one byte too many into a string, and thereby
>zeroing the high-order byte of an adjacent pointer, could lead to
>exactly this message when we later try to pfree() the pointer.
>
>I am wondering if Wim is running into that same Solaris snprintf() bug
>that we discovered awhile back --- it was not clear if the bug still
>exists in Solaris 8, but the symptoms sure match.  See
>http://archives.postgresql.org/pgsql-bugs/2002-07/msg00059.php
>
>It would be useful to see a stack traceback from the point of the error,
>if possible.
>
>            regards, tom lane
>
I Would like to send a stack traceback, but I need some halp on this
(never done this before).

some add. info:

SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r';
gives:
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

SELECT relname FROM pg_class;
works well...

SELECT relname, relkind from pg_class WHERE relkind='r';
works also...

SELECT relname, relkind from pg_class WHERE relname like 'pg_%';
produces the same error as above...

>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>
>
>
>
Cheers!

Wim


Re: [GENERAL] Still big problems with pg_dump!

From
Tom Lane
Date:
Wim <wdh@belbone.be> writes:
> Tom Lane wrote:
>> It would be useful to see a stack traceback from the point of the error,
>> if possible.

> I Would like to send a stack traceback, but I need some halp on this
> (never done this before).

> some add. info:

> SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r';
> gives:
> server closed the connection unexpectedly

This should be producing a core file in your database directory
($PGDATA/base/yourdboid/).  With gdb you'd do
    gdb /path/to/postgres-executable /path/to/corefile
    gdb> bt
    gdb> quit
I don't remember the equivalent incantations with Solaris' debugger.

            regards, tom lane

Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
On Tue, Sep 17, 2002 at 05:04:23PM +0200, Wim wrote:
>
> SELECT relname FROM pg_class WHERE relname like 'pg_%' AND relkind = 'r';
> gives:
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> SELECT relname, relkind from pg_class WHERE relname like 'pg_%';
> produces the same error as above...

That does rather suggest a memory clobber, as Tom suggested.  Wim's
'file postmaster' shows it's a 32-bit binary, though, and I verified
in my notes that the snprintf bug was only in the 64-bit library.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Andrew Sullivan
Date:
On Tue, Sep 17, 2002 at 11:08:46AM -0400, Tom Lane wrote:

> This should be producing a core file in your database directory
> ($PGDATA/base/yourdboid/).  With gdb you'd do
>     gdb /path/to/postgres-executable /path/to/corefile
>     gdb> bt
>     gdb> quit
> I don't remember the equivalent incantations with Solaris' debugger.

I think it's

adb /path/to/postgres-executable /path/to/corefile
$c

[or]

$C

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:
gdb gives me this:

bash-2.05$ adb /usr/local/pgsql/bin/postgres
/data/postgres/base/17903709/core
core file = /data/postgres/base/17903709/core -- program
``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60
SIGBUS: Bus Error
$c
AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13)
MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00)
MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541)
make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1)
prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b,
ffbfe6a0)
patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728)
likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809)
OidFunctionCall4+0x124(71b, 4c0780, 4b7, 4c7b90, 1, ffbfe7e1)
restriction_selectivity+0x64(4c0780, 4b7, 4c7b90, 1, 0, ffbfe8a0)
clauselist_selectivity+0x164(4c0780, 4c8478, 1, 0, 0, 4c8295)
restrictlist_selectivity+0x2c(4c0780, 4c7c18, 1, 0, 0, ff0000)
set_baserel_size_estimates+0x2c(4c0780, 4c7d70, ffffffff, fffffff8, 0,
4c83f9)
set_plain_rel_pathlist+0x18(4c0780, 4c7d70, 4c0808, 53, 4c0348, 20)
set_base_rel_pathlists+0xf8(4c0780, 4c8460, 1, 0, 0, 4c7c00)
make_one_rel+0xc(4c0780, 0, ffbfecb8, ffbfecb0, 0, 0)
subplanner+0x148(4c0780, 4c7cb8, 0, 0, 0, 0)
query_planner+0x98(4c0780, 4c7a48, 0, 0, 0, 0)
grouping_planner+0x7cc(4c0780, bff00000, 0, ff13a000, 0, 0)
subquery_planner+0x260(4c0780, bff00000, 0, 7efefeff, 81010100, ff0000)
planner+0x54(4c0780, 4c1e00, 4c15e0, fffffff8, 0, ffbfffd5)
pg_plan_query+0x54(4c0780, 29b99c, 0, 53, 4c0348, 20)
pg_exec_query_string+0x388(4c0348, 2, 476010, 4c0330, 800000, 0)
PostgresMain+0x1398(5, ffbff2d0, 40c1d9, 473, 0, ffbff1c8)
DoBackend+0x7d8(40c0a8, 1, 22730, 1552dc, 0, 40c2d9)
BackendStartup+0xb0(40c0a8, 5, ffbff800, ffbff5d8, ffbff658, 0)
ServerLoop+0x370(297024, 49a0, 0, 3f1f88, 297004, 2d560000)
PostmasterMain+0xbe4(5, 3f2980, 65720000, 0, 65720000, 65720000)
main+0x294(5, ffbffd8c, ffbffda4, 3e64c0, 0, 0)
_start+0x5c(0, 0, 0, 0, 0, 0)
$C
ffbfe340 AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13)
ffbfe3e0 MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff,
81010100, ff00)
ffbfe450 MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0,
ffbfe541)
ffbfe4c8 make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1)
ffbfe548 prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670,
ffbfe68b, ffbfe6a0)
ffbfe5d8 patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728)
ffbfe6b0 likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809)
ffbfe728 OidFunctionCall4+0x124(71b, 4c0780, 4b7, 4c7b90, 1, ffbfe7e1)
ffbfe828 restriction_selectivity+0x64(4c0780, 4b7, 4c7b90, 1, 0, ffbfe8a0)
ffbfe8b0 clauselist_selectivity+0x164(4c0780, 4c8478, 1, 0, 0, 4c8295)
ffbfe958 restrictlist_selectivity+0x2c(4c0780, 4c7c18, 1, 0, 0, ff0000)
ffbfe9d8 set_baserel_size_estimates+0x2c(4c0780, 4c7d70, ffffffff,
fffffff8, 0, 4c83f9)
ffbfea48 set_plain_rel_pathlist+0x18(4c0780, 4c7d70, 4c0808, 53, 4c0348, 20)
ffbfeab8 set_base_rel_pathlists+0xf8(4c0780, 4c8460, 1, 0, 0, 4c7c00)
ffbfeb40 make_one_rel+0xc(4c0780, 0, ffbfecb8, ffbfecb0, 0, 0)
ffbfebb8 subplanner+0x148(4c0780, 4c7cb8, 0, 0, 0, 0)
ffbfec70 query_planner+0x98(4c0780, 4c7a48, 0, 0, 0, 0)
ffbfecf8 grouping_planner+0x7cc(4c0780, bff00000, 0, ff13a000, 0, 0)
ffbfedc8 subquery_planner+0x260(4c0780, bff00000, 0, 7efefeff, 81010100,
ff0000)
ffbfee58 planner+0x54(4c0780, 4c1e00, 4c15e0, fffffff8, 0, ffbfffd5)
ffbfeed8 pg_plan_query+0x54(4c0780, 29b99c, 0, 53, 4c0348, 20)
ffbfef50 pg_exec_query_string+0x388(4c0348, 2, 476010, 4c0330, 800000, 0)
ffbff038 PostgresMain+0x1398(5, ffbff2d0, 40c1d9, 473, 0, ffbff1c8)
ffbff0f8 DoBackend+0x7d8(40c0a8, 1, 22730, 1552dc, 0, 40c2d9)
ffbff4e8 BackendStartup+0xb0(40c0a8, 5, ffbff800, ffbff5d8, ffbff658, 0)
ffbff568 ServerLoop+0x370(297024, 49a0, 0, 3f1f88, 297004, 2d560000)
ffbff810 PostmasterMain+0xbe4(5, 3f2980, 65720000, 0, 65720000, 65720000)
ffbffca0 main+0x294(5, ffbffd8c, ffbffda4, 3e64c0, 0, 0)
ffbffd28 _start+0x5c(0, 0, 0, 0, 0, 0)



Andrew Sullivan wrote:

>On Tue, Sep 17, 2002 at 11:08:46AM -0400, Tom Lane wrote:
>
>
>>This should be producing a core file in your database directory
>>($PGDATA/base/yourdboid/).  With gdb you'd do
>>    gdb /path/to/postgres-executable /path/to/corefile
>>    gdb> bt
>>    gdb> quit
>>I don't remember the equivalent incantations with Solaris' debugger.
>>
>
>I think it's
>
>adb /path/to/postgres-executable /path/to/corefile
>$c
>
>[or]
>
>$C
>
>A
>
>



Re: [GENERAL] Still big problems with pg_dump!

From
Tom Lane
Date:
Wim <wdh@belbone.be> writes:
> gdb gives me this:
> bash-2.05$ adb /usr/local/pgsql/bin/postgres
> /data/postgres/base/17903709/core
> core file = /data/postgres/base/17903709/core -- program
> ``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60
> SIGBUS: Bus Error
> $c
> AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13)
> MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00)
> MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541)
> make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1)
> prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b,
> ffbfe6a0)
> patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728)
> likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809)

Hm.  Are you running in a multibyte character encoding?  I had a note
that make_greater_string may have problems in the MULTIBYTE case.

            regards, tom lane

Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:

Tom Lane wrote:

>Wim <wdh@belbone.be> writes:
>
>
>>gdb gives me this:
>>bash-2.05$ adb /usr/local/pgsql/bin/postgres
>>/data/postgres/base/17903709/core
>>core file = /data/postgres/base/17903709/core -- program
>>``/usr/local/pgsql/bin/postgres'' on platform SUNW,Ultra-60
>>SIGBUS: Bus Error
>>$c
>>AllocSetAlloc+0x18c(476120, 3, 226018, 0, 0, 13)
>>MemoryContextAlloc+0x68(476120, 3, 70670000, 7efefeff, 81010100, ff00)
>>MemoryContextStrdup+0x28(476120, 4c8568, ffffffff, fffffff8, 0, ffbfe541)
>>make_greater_string+0x1c(4c8568, 13, 297, 4c8730, 0, ffbfe5d1)
>>prefix_selectivity+0xcc(4c0780, 4c7ab8, 4c8568, ffbfe670, ffbfe68b,
>>ffbfe6a0)
>>patternsel+0x278(ffbfe7a0, 0, 476120, 0, 0, ffbfe728)
>>likesel+0x10(ffbfe7a0, ffbfe7a0, 1d7080, fffffff8, 0, ffbfe809)
>>
>>
>
>Hm.  Are you running in a multibyte character encoding?  I had a note
>that make_greater_string may have problems in the MULTIBYTE case.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
>
>
>
Yes, I compiled postgres with multibyte and ODBC support . Is there a
workaround possibel to fix my problem?

Cheers!

Wim


Re: [GENERAL] Still big problems with pg_dump!

From
Tom Lane
Date:
Wim <wdh@belbone.be> writes:
>> Hm.  Are you running in a multibyte character encoding?  I had a note
>> that make_greater_string may have problems in the MULTIBYTE case.

> Yes, I compiled postgres with multibyte and ODBC support .

But are you actually *using* the multibyte code?  What does "psql -l"
show as the encoding for your database?

            regards, tom lane

Re: [GENERAL] Still big problems with pg_dump!

From
Wim
Date:

Tom Lane wrote:

>Wim <wdh@belbone.be> writes:
>
>>>Hm.  Are you running in a multibyte character encoding?  I had a note
>>>that make_greater_string may have problems in the MULTIBYTE case.
>>>
>
>>Yes, I compiled postgres with multibyte and ODBC support .
>>
>
>But are you actually *using* the multibyte code?  What does "psql -l"
>show as the encoding for your database?
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/users-lounge/docs/faq.html
>
>
>
psql -l shows:

          List of databases
     Name      |  Owner   | Encoding
---------------+----------+-----------
 addressIP     | postgres | SQL_ASCII
 belbonedb_v2  | postgres | SQL_ASCII
 belbonedb_v21 | postgres | SQL_ASCII
 peering       | postgres | SQL_ASCII
 peering_v2    | postgres | SQL_ASCII
 postgres      | postgres | SQL_ASCII
 smsbilling    | postgres | SQL_ASCII
 template0     | postgres | SQL_ASCII
 template1     | postgres | SQL_ASCII
(9 rows)

Maybe I should recompile Postgres without multibyte support...


Cheers!

Wim