Thread: Serious Crash last Friday

Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello all,

on Friday we experienced a very very worrying crash of our postgresql
server.

Our system runs on a Intel Pentium IV, 1.6 GHz, 1 GB RAM with latest
Postgresql-Server
7.2.1 (redhat rpm) - on rather heavy load

There are 109 user tables in one database, the largest tables contain 60
columns and
approx. 400.000 rows. it's a dedicated database-only machine.

Well, the crash was indicated as follows: One of my employees complained
that she couldn't
work anymore (via webinterface). The error-message was due to an error in
the
employee-table. This particular table has a unique row for employee-numbers.
Suddenly
there were 11 entries for the same employee. Even my name was included
twice, and
another employee still working on friday afternoon was also included 3
times. Note:
This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

Taking a closer look, I found additional tables, with non-unique values in
UNIQUE columns.

When trying to delete unique values by using the OIDs, I found out, that
even the OIDs
were the same!!!! Taking a yet closer look, I found out by querying
pg_tables that
there were duplicates of some tables. Then there was the message: "Backend
message type
0x44 arrived while idle"

I was running VACUUM and VACUUM FULL a hundred times - but it failed to
repair these
errors. It didn't even succeed in running VACUUM on all tables: VACUUM
complained something
about "UNIQUE" (I didn't write down the exact error message though).

Then I tried to DUMP as much as I could, then I stopped the database, moved
the db-folder to
a different location, did a new initdb and restored the whole system.
Unfortunately
there was one table I couldn't dump at all and I had to use the 15 hours old
backup copy.

But, please correct me if I am wrong, this should never actually happen,
shouldn't it?

Anyone had any of these problems before? I will see if this happens again -
and if it
does I will have to think about using a different backend-server. I'll don't
have to
explain to you, that a database server that corrupts data, is completely
useless.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
Martijn van Oosterhout
Date:
On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:
>
> Hello all,
>
> on Friday we experienced a very very worrying crash of our postgresql
> server.

Sound like the CTIDs are out of whack or something. If you're really
desperate you can try the program here, it may be able to dump something.
http://svana.org/kleptog/pgsql/pgfsck.html

> Well, the crash was indicated as follows: One of my employees complained
> that she couldn't
> work anymore (via webinterface). The error-message was due to an error in
> the
> employee-table. This particular table has a unique row for employee-numbers.
> Suddenly
> there were 11 entries for the same employee. Even my name was included
> twice, and
> another employee still working on friday afternoon was also included 3
> times. Note:
> This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.

What DB version is this. Could it be XID wraparound?

> Taking a closer look, I found additional tables, with non-unique values in
> UNIQUE columns.
>
> When trying to delete unique values by using the OIDs, I found out, that
> even the OIDs
> were the same!!!! Taking a yet closer look, I found out by querying
> pg_tables that
> there were duplicates of some tables. Then there was the message: "Backend
> message type
> 0x44 arrived while idle"

Try the CTIDs, they will be unique.

> I was running VACUUM and VACUUM FULL a hundred times - but it failed to
> repair these
> errors. It didn't even succeed in running VACUUM on all tables: VACUUM
> complained something
> about "UNIQUE" (I didn't write down the exact error message though).

Please post the message exactly as printed out.

> Then I tried to DUMP as much as I could, then I stopped the database, moved
> the db-folder to
> a different location, did a new initdb and restored the whole system.
> Unfortunately
> there was one table I couldn't dump at all and I had to use the 15 hours old
> backup copy.
>
> But, please correct me if I am wrong, this should never actually happen,
> shouldn't it?

Never, that's why it would be helpful to know what went wrong.

> Anyone had any of these problems before? I will see if this happens again -
> and if it
> does I will have to think about using a different backend-server. I'll don't
> have to
> explain to you, that a database server that corrupts data, is completely
> useless.

HTH,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hi Martijn,

cute little program you pointed me to, thank you. So I am not the only
one expiriencing problems on certain SELECTs sometimes. That's another
very annoying thing about postgresql. Had it several times by now and
always tried to find the corrupted tuples by hand...

ok, but back to the crash of last friday:

> What DB version is this. Could it be XID wraparound?

it's postgres 7.2.1

what actually is XID wraparound, and how can I find out if I have it?

> Try the CTIDs, they will be unique.

ah, this was also new for me - have been working with oids sometimes,
but never heard of ctids before, thanks again.

> Please post the message exactly as printed out.

This is what I can see from /var/log/messages:
(these messages were often repeated:)

XLogFlush: request D/39CC9F8 is not satisfied - flushed only to D/39A4354

(some messages are in German, I'll try to translate them:)

Can't create "Unique"-Index, because table contains duplicated values

This happened while vacuuming:
Duplicated value cannot be inserted in "Unique"-Index pg_class_relname_index
Duplicated value cannot be inserted in "Unique"-Index
pg_statistic_relid_att_index
Duplicated value cannot be inserted in "Unique"-Index pg_class_oid_index

Looks like these system-tables have been corrupted, too

As i mentioned before, I copied the complete data-directory to a different
location, so
someone could have a look at the complete corrupted data.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 9:43 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:
> >
> > Hello all,
> >
> > on Friday we experienced a very very worrying crash of our postgresql
> > server.
>
> Sound like the CTIDs are out of whack or something. If you're really
> desperate you can try the program here, it may be able to dump something.
> http://svana.org/kleptog/pgsql/pgfsck.html
>
> > Well, the crash was indicated as follows: One of my employees complained
> > that she couldn't
> > work anymore (via webinterface). The error-message was due to an error
in
> > the
> > employee-table. This particular table has a unique row for
employee-numbers.
> > Suddenly
> > there were 11 entries for the same employee. Even my name was included
> > twice, and
> > another employee still working on friday afternoon was also included 3
> > times. Note:
> > This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.
>
> What DB version is this. Could it be XID wraparound?
>
> > Taking a closer look, I found additional tables, with non-unique values
in
> > UNIQUE columns.
> >
> > When trying to delete unique values by using the OIDs, I found out, that
> > even the OIDs
> > were the same!!!! Taking a yet closer look, I found out by querying
> > pg_tables that
> > there were duplicates of some tables. Then there was the message:
"Backend
> > message type
> > 0x44 arrived while idle"
>
> Try the CTIDs, they will be unique.
>
> > I was running VACUUM and VACUUM FULL a hundred times - but it failed to
> > repair these
> > errors. It didn't even succeed in running VACUUM on all tables: VACUUM
> > complained something
> > about "UNIQUE" (I didn't write down the exact error message though).
>
> Please post the message exactly as printed out.
>
> > Then I tried to DUMP as much as I could, then I stopped the database,
moved
> > the db-folder to
> > a different location, did a new initdb and restored the whole system.
> > Unfortunately
> > there was one table I couldn't dump at all and I had to use the 15 hours
old
> > backup copy.
> >
> > But, please correct me if I am wrong, this should never actually happen,
> > shouldn't it?
>
> Never, that's why it would be helpful to know what went wrong.
>
> > Anyone had any of these problems before? I will see if this happens
again -
> > and if it
> > does I will have to think about using a different backend-server. I'll
don't
> > have to
> > explain to you, that a database server that corrupts data, is completely
> > useless.
>
> HTH,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello,

trying pgfsck on my corrupted employee table from friday it gave me about 85
lines complaining
about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"

the table had 184 rows, out of which 85 were corrupt ??


trying pgfsck on the current employee table of today (after new initdb etc.)
with 184 rows,
I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how can
this be???


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 9:43 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Mon, Jun 17, 2002 at 08:43:37AM +0200, Henrik Steffen wrote:
> >
> > Hello all,
> >
> > on Friday we experienced a very very worrying crash of our postgresql
> > server.
>
> Sound like the CTIDs are out of whack or something. If you're really
> desperate you can try the program here, it may be able to dump something.
> http://svana.org/kleptog/pgsql/pgfsck.html
>
> > Well, the crash was indicated as follows: One of my employees complained
> > that she couldn't
> > work anymore (via webinterface). The error-message was due to an error
in
> > the
> > employee-table. This particular table has a unique row for employee-numb
ers.
> > Suddenly
> > there were 11 entries for the same employee. Even my name was included
> > twice, and
> > another employee still working on friday afternoon was also included 3
> > times. Note:
> > This was a table with a UNIQUE KEY - this shouldn't be possible IMHO.
>
> What DB version is this. Could it be XID wraparound?
>
> > Taking a closer look, I found additional tables, with non-unique values
in
> > UNIQUE columns.
> >
> > When trying to delete unique values by using the OIDs, I found out, that
> > even the OIDs
> > were the same!!!! Taking a yet closer look, I found out by querying
> > pg_tables that
> > there were duplicates of some tables. Then there was the message:
"Backend
> > message type
> > 0x44 arrived while idle"
>
> Try the CTIDs, they will be unique.
>
> > I was running VACUUM and VACUUM FULL a hundred times - but it failed to
> > repair these
> > errors. It didn't even succeed in running VACUUM on all tables: VACUUM
> > complained something
> > about "UNIQUE" (I didn't write down the exact error message though).
>
> Please post the message exactly as printed out.
>
> > Then I tried to DUMP as much as I could, then I stopped the database,
moved
> > the db-folder to
> > a different location, did a new initdb and restored the whole system.
> > Unfortunately
> > there was one table I couldn't dump at all and I had to use the 15 hours
old
> > backup copy.
> >
> > But, please correct me if I am wrong, this should never actually happen,
> > shouldn't it?
>
> Never, that's why it would be helpful to know what went wrong.
>
> > Anyone had any of these problems before? I will see if this happens
again -
> > and if it
> > does I will have to think about using a different backend-server. I'll
don't
> > have to
> > explain to you, that a database server that corrupts data, is completely
> > useless.
>
> HTH,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Serious Crash last Friday

From
Martijn van Oosterhout
Date:
On Mon, Jun 17, 2002 at 10:39:21AM +0200, Henrik Steffen wrote:
>
> Hello,
>
> trying pgfsck on my corrupted employee table from friday it gave me about 85
> lines complaining
> about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"
>
> the table had 184 rows, out of which 85 were corrupt ??

Hmm, it depends mostly on whether the parsed was greater or less than the
length. If it's less than it is something to do with additional padding
being added at the end, which I havn't worked out yet. I'm thinking of
supressing the warning where the remaining looks like padding.

If it's greater, there is a problem.

> trying pgfsck on the current employee table of today (after new initdb etc.)
> with 184 rows,
> I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how can
> this be???

Same or different tables?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
same employee table (but after initdb, and restored from fridays dump)

how can i find out the correct length of the tuples?

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 11:27 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Mon, Jun 17, 2002 at 10:39:21AM +0200, Henrik Steffen wrote:
> >
> > Hello,
> >
> > trying pgfsck on my corrupted employee table from friday it gave me
about 85
> > lines complaining
> > about "Tuple incorrect length (parsed data=xxxxxx, length=xxx)"
> >
> > the table had 184 rows, out of which 85 were corrupt ??
>
> Hmm, it depends mostly on whether the parsed was greater or less than the
> length. If it's less than it is something to do with additional padding
> being added at the end, which I havn't worked out yet. I'm thinking of
> supressing the warning where the remaining looks like padding.
>
> If it's greater, there is a problem.
>
> > trying pgfsck on the current employee table of today (after new initdb
etc.)
> > with 184 rows,
> > I get 814 (!!) rows complaining about "Tuple incorrect length ..." - how
can
> > this be???
>
> Same or different tables?
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.


Re: Serious Crash last Friday

From
Martijn van Oosterhout
Date:
On Mon, Jun 17, 2002 at 11:38:00AM +0200, Henrik Steffen wrote:
> same employee table (but after initdb, and restored from fridays dump)
>
> how can i find out the correct length of the tuples?

Tuple incorrect length (parsed data=xxxxxx, length=xxx)

"parsed data" is what was processed
"length" is the amount allocated in the tuple header

so, "parsed data" <= "length" is good, though hopefully it's not too much
less.

HTH,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
> so, "parsed data" <= "length" is good, though hopefully it's not too much
> less.

well,

parsed data ==                  825242056
length ==                       84

I fear, it's toooooo much less.

BTW: The count of rows printed by the pgfsck increases permanently.
In my last mail, there were only 814 rows, 2 hours ago, there were 1200
rows, now there are 1600 rows !!! What does this mean? In the table, there
are only 185 rows like before.

There is always about 800000000 parsed data, and about 100-150 of length.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: <pgsql-general@postgresql.org>
Sent: Monday, June 17, 2002 12:02 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Mon, Jun 17, 2002 at 11:38:00AM +0200, Henrik Steffen wrote:
> > same employee table (but after initdb, and restored from fridays dump)
> >
> > how can i find out the correct length of the tuples?
>
> Tuple incorrect length (parsed data=xxxxxx, length=xxx)
>
> "parsed data" is what was processed
> "length" is the amount allocated in the tuple header
>
> so, "parsed data" <= "length" is good, though hopefully it's not too much
> less.
>
> HTH,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Serious Crash last Friday

From
Martijn van Oosterhout
Date:
On Mon, Jun 17, 2002 at 01:44:22PM +0200, Henrik Steffen wrote:
> > so, "parsed data" <= "length" is good, though hopefully it's not too much
> > less.
>
> well,
>
> parsed data ==                  825242056
> length ==                       84
>
> I fear, it's toooooo much less.

That's wierd. Something is really wierd here, as that number is 0x313031C8,
which has '101' in text format. What datatypes are you using. Seems more
like a bug in my program.

> BTW: The count of rows printed by the pgfsck increases permanently.
> In my last mail, there were only 814 rows, 2 hours ago, there were 1200
> rows, now there are 1600 rows !!! What does this mean? In the table, there
> are only 185 rows like before.
>
> There is always about 800000000 parsed data, and about 100-150 of length.

Are you using -a? It will print all rows, including deleted ones. That will
obviously continue to increase during usage. Are you doing this on a live
database?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
yes, I was doing it on a living database... not good?

I was not using -a


this is structure dump of the employee table:

/* --------------------------------------------------------
  phpPgAdmin 2.4-1 DB Dump
  http://www.sourceforge.net/projects/phppgadmin/
  Host: db.city-map.de:5432
  Database : "kunden"
  Table : "mitarbeiter"
  2002-06-17 16:06:15
-------------------------------------------------------- */

CREATE TABLE "mitarbeiter" (
   "miano" char(6),
   "name" text,
   "email" text,
   "titel" text,
   "telefon" text,
   "pos" int4,
   "benutzername" text,
   "lastlogindate" date,
   "lastlogintime" time,
   "sprache" int4,
   "prov1" int2,
   "prov2" int2,
   "region" char(4)
);
CREATE  UNIQUE INDEX "mitarbeiter_miano_idx" ON "mitarbeiter" ("miano");




Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
well, I don't work with transactions at all...

and I can't see no dupes...


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Martijn van Oosterhout" <kleptog@svana.org>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Tuesday, June 18, 2002 1:25 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Mon, Jun 17, 2002 at 05:28:23PM +0200, Henrik Steffen wrote:
> > yes, I was doing it on a living database... not good?
> >
> > I was not using -a
>
> It doesn't really matter if you're doing it on a live database, it just
> means that it won't see stuff that is in WAL that has not been committed
to
> the main storage.
>
> If you're not using -a then something else is going on. Do you have many
> uncomitted transactions? The program is not particularly intelligent about
> which rows are currently active, so you will probably see some duplicates.
> Look at the oids shown on the right of each tuple. Also, I've just
uploaded
> a new version which deals with dates and times much better.
>
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > There are 10 kinds of people in the world, those that can do binary
> > arithmetic and those that can't.


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE:  Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
 server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
 at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
properly again.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.

But shouldn't this be unnecessary ?



Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Dear Philippe,

I have always been using the VACUUM ANALYZE on a daily basis.

I recently changed this cronjob into "VACUUM FULL ANALYZE" - which
didn't help either.

What seems to be helpful is, to drop all user indexes and recreate them
on a daily basis (before, I did this only on a weekly basis once every
sunday)

What exactly is XID wraparound? How can I make sure I am using it? Is it
anything
I shouldn't use? Or should I change anything to make my system run more
stable?

Any help highly appreciated

thanks


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 9:06 AM
Subject: RE: [GENERAL] Serious Crash last Friday


Henrik,

I think it was Martijn who asked at a certain point if you made sure there
was no ... (XID ?) wraparound. As it's a heavily used database, couldn't
this be the cause ? i.e., did you try since then to regularly perform the
vacuum analyze (as I see this, on a daily basis) ?

Regards,

Philippe Bertin
Software Development Engineer Avionics
---------------------------------------------------

> -----Original Message-----
> From: Henrik Steffen [SMTP:steffen@city-map.de]
> Sent: donderdag 20 juni 2002 8:54
> To: pg
> Subject: Re: [GENERAL] Serious Crash last Friday
>
>
> Hello all,
>
> unfortunately I didn't get a really helping answer from the list yet.
>
> Additionally yesterday night there was again a problem with some SELECTs:
>
> NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted shared memory.
> I have rolled back the current transaction and am
> going to terminate your database system connection and exit.
> Please reconnect to the database system and repeat your query.
> DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
>  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Command was:
> SELECT name
> FROM regionen
> WHERE region='0119';
>  at /web/pm/CityMap/Abfragen.pm line 135
>
> This is really annoying.
>
> When I noticed it this morning, I dropped all indexes and recreated them.
> Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query
> worked
> properly again.
>
> I have now created a cronjob that will drop and recreate all indexes on a
> daily basis.
>
> But shouldn't this be unnecessary ?
>
>
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hi,

thanks for this piece of information. Do you know, where and how I can see
which XID is the current? From this I could tell if there more than 2M
transactions a day.

I don't use any triggers. But on our heavily loaded website, there are about
60.000 pageviews on an average day, and each page view triggers maybe 10 -
15 database
lookups.

so we might get rather close to 2 M a day on certain days with more
pageviews.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
not only lookups, but also inserts, updates and deletions, of course...

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Bertin, Philippe" <philippe.bertin@barco.com>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, June 20, 2002 11:33 AM
Subject: RE: [GENERAL] Serious Crash last Friday


Hello,

If you're only having lookups, I still doubt that this kind of problem would
be able to occur (however, never be surprised of anything :)  How you can
see the current ID, I can't tell, but I think that was also in the same
text. I suggest you have a look into the official PostgreSQL- documentation,
probably the Admin's guide, or the User's guide, because I think it was
there I saw this explanation.

Kind regards,

Philippe.

> -----Original Message-----
> From: Henrik Steffen [SMTP:steffen@city-map.de]
> Sent: donderdag 20 juni 2002 11:29
> To: Bertin, Philippe
> Subject: Re: [GENERAL] Serious Crash last Friday
>
> Hi,
>
> thanks for this piece of information. Do you know, where and how I can see
> which XID is the current? From this I could tell if there more than 2M
> transactions
> a day.
>
> I don't use any triggers. But on our heavily loaded website, there are
> about
> 60.000 pageviews a day, and each page view triggers maybe 10 - 15 database
> lookups.
> so we might get rather close to 2 M a day on certain days.
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
> ----- Original Message -----
> From: "Bertin, Philippe" <philippe.bertin@barco.com>
> To: "Henrik Steffen" <steffen@city-map.de>
> Sent: Thursday, June 20, 2002 10:52 AM
> Subject: RE: [GENERAL] Serious Crash last Friday
>
>
> Henrik,
>
> Not that I know so much about it (I don't even remember if it's really
> called 'XID'- raparound). But it comes to something like : every action on
> the database gets an ID. This ID is contained in 32 bits, so 4M; only half
> of the range can be used, so 2M. Every 2M actions, the ID wraps around. If
> that happens, your data may get corrupt. A vacuum analyze resets these
> ID's
> ...
>
> I don't remember where I read this. I'll look around as from when I have
> time, and mail you back the reference for it. I think it'll be around this
> evening (Belgian time) when I'll find the time for doing so ... Maybe
> Martijn could give you some hints ?
>
> However, if you indicate that you're doing a vacuum analyze on a daily
> basis, I doubt that that can be the reason (unless you would have a lot of
> triggers on your tables that increment the number of actions exponentially
> by recursively calling each other ?)
>
> Kind regards,
>
> Philippe Bertin.
>
> > -----Original Message-----
> > From: Henrik Steffen [SMTP:steffen@city-map.de]
> > Sent: donderdag 20 juni 2002 9:34
> > To: Bertin, Philippe
> > Cc: pg
> > Subject: Re: [GENERAL] Serious Crash last Friday
> >
> >
> > Dear Philippe,
> >
> > I have always been using the VACUUM ANALYZE on a daily basis.
> >
> > I recently changed this cronjob into "VACUUM FULL ANALYZE" - which
> > didn't help either.
> >
> > What seems to be helpful is, to drop all user indexes and recreate them
> > on a daily basis (before, I did this only on a weekly basis once every
> > sunday)
> >
> > What exactly is XID wraparound? How can I make sure I am using it? Is it
> > anything
> > I shouldn't use? Or should I change anything to make my system run more
> > stable?
> >
> > Any help highly appreciated
> >
> > thanks
> >
> >
> > Mit freundlichem Gruß
> >
> > Henrik Steffen
> > Geschäftsführer
> >
> > top concepts Internetmarketing GmbH
> > Am Steinkamp 7 - D-21684 Stade - Germany
> > --------------------------------------------------------
> > http://www.topconcepts.com          Tel. +49 4141 991230
> > mail: steffen@topconcepts.com       Fax. +49 4141 991233
> > --------------------------------------------------------
> > 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> > --------------------------------------------------------
> > System-Partner gesucht: http://www.franchise.city-map.de
> > --------------------------------------------------------
> > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> > --------------------------------------------------------
> >
> > ----- Original Message -----
> > From: "Bertin, Philippe" <philippe.bertin@barco.com>
> > To: "Henrik Steffen" <steffen@city-map.de>
> > Sent: Thursday, June 20, 2002 9:06 AM
> > Subject: RE: [GENERAL] Serious Crash last Friday
> >
> >
> > Henrik,
> >
> > I think it was Martijn who asked at a certain point if you made sure
> there
> > was no ... (XID ?) wraparound. As it's a heavily used database, couldn't
> > this be the cause ? i.e., did you try since then to regularly perform
> the
> > vacuum analyze (as I see this, on a daily basis) ?
> >
> > Regards,
> >
> > Philippe Bertin
> > Software Development Engineer Avionics
> > ---------------------------------------------------
> >
> > > -----Original Message-----
> > > From: Henrik Steffen [SMTP:steffen@city-map.de]
> > > Sent: donderdag 20 juni 2002 8:54
> > > To: pg
> > > Subject: Re: [GENERAL] Serious Crash last Friday
> > >
> > >
> > > Hello all,
> > >
> > > unfortunately I didn't get a really helping answer from the list yet.
> > >
> > > Additionally yesterday night there was again a problem with some
> > SELECTs:
> > >
> > > NOTICE:  Message from PostgreSQL backend:
> > > The Postmaster has informed me that some other backend
> > > died abnormally and possibly corrupted shared memory.
> > > I have rolled back the current transaction and am
> > > going to terminate your database system connection and exit.
> > > Please reconnect to the database system and repeat your query.
> > > DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code
> > 7:
> > >  server closed the connection unexpectedly
> > > This probably means the server terminated abnormally
> > > before or while processing the request.
> > >
> > > Command was:
> > > SELECT name
> > > FROM regionen
> > > WHERE region='0119';
> > >  at /web/pm/CityMap/Abfragen.pm line 135
> > >
> > > This is really annoying.
> > >
> > > When I noticed it this morning, I dropped all indexes and recreated
> > them.
> > > Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query
> > > worked
> > > properly again.
> > >
> > > I have now created a cronjob that will drop and recreate all indexes
> on
> > a
> > > daily basis.
> > >
> > > But shouldn't this be unnecessary ?
> > >
> > >
> > >
> > > Mit freundlichem Gruß
> > >
> > > Henrik Steffen
> > > Geschäftsführer
> > >
> > > top concepts Internetmarketing GmbH
> > > Am Steinkamp 7 - D-21684 Stade - Germany
> > > --------------------------------------------------------
> > > http://www.topconcepts.com          Tel. +49 4141 991230
> > > mail: steffen@topconcepts.com       Fax. +49 4141 991233
> > > --------------------------------------------------------
> > > 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> > > --------------------------------------------------------
> > > System-Partner gesucht: http://www.franchise.city-map.de
> > > --------------------------------------------------------
> > > Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> > > --------------------------------------------------------
> > >
> > >
> > >
> > > ---------------------------(end of
> broadcast)---------------------------
> > > TIP 5: Have you checked our extensive FAQ?
> > >
> > > http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious Crash last Friday

From
Stephan Szabo
Date:
On Thu, 20 Jun 2002, Henrik Steffen wrote:

>
> Hello all,
>
> unfortunately I didn't get a really helping answer from the list yet.
>
> Additionally yesterday night there was again a problem with some SELECTs:
>
> NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted shared memory.
> I have rolled back the current transaction and am
> going to terminate your database system connection and exit.
> Please reconnect to the database system and repeat your query.
> DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
>  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

As a question, what does the log say and did it leave a core in the
database directory that you can get a backtrace from?


Re: Serious Crash last Friday

From
Scott Marlowe
Date:
On Thu, 20 Jun 2002, Henrik Steffen wrote:

> Additionally yesterday night there was again a problem with some SELECTs:
>
> NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
> died abnormally and possibly corrupted shared memory.
> I have rolled back the current transaction and am
> going to terminate your database system connection and exit.
> Please reconnect to the database system and repeat your query.
> DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
>  server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Look at that error message again.  It says SOME OTHER backend died
abnormally, and your query was terminated because of it.  I.e. the
following query was NOT the problem it simply couldn't run because some
other query caused a backend to abort.

>
> Command was:
> SELECT name
> FROM regionen
> WHERE region='0119';
>  at /web/pm/CityMap/Abfragen.pm line 135
>
> This is really annoying.

Yes it is.

> When I noticed it this morning, I dropped all indexes and recreated them.
> Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
> properly again.

It would likely have worked fine without all that, since it wasn't the
cause of the backend crash.

> I have now created a cronjob that will drop and recreate all indexes on a
> daily basis.
>
> But shouldn't this be unnecessary ?

Correct.  Someday, someone will step up to the plate and fix the problem
with btrees growing and growing and not reusing dead space.

Til then the solution is to reindex heavily updated indexes during nightly
maintenance.

A few questions.  Have you done any really heavy testing on your server to
make sure it has no problems with its hardware or anything?  I've seen
machines with memory errors or bad blocks on the hard drive slip into
production and wreak havoc due to slow corruption of a database.

Try compiling the linux kernel with a -j 10 switch (i.e. 10 seperate
threads, eats up tons of memory) and see if you get sig 11 errors.  Also,
check your hard drives for bad blocks as well (badblock is the command,
and it can run in a "save a block before write testing it then put the
data back in it" mode that lets you find all the bad blocks on your hard
drives.

Bad blocks are the primary reason I always try to run my database on RAID1
or RAID5 software raid as a minimum on Linux, since a bad block will cause
the affected drive to be marked offline, and not affect your data
integrity.

--
"Force has no place where there is need of skill.", "Haste in every
business brings failures.", "This is the bitterest pain among men, to have
much knowledge but no power." -- Herodotus



Re: Serious Crash last Friday

From
Alvaro Herrera
Date:
Scott Marlowe dijo:

> On Thu, 20 Jun 2002, Henrik Steffen wrote:

> > I have now created a cronjob that will drop and recreate all indexes on a
> > daily basis.
> >
> > But shouldn't this be unnecessary ?

[...]

> A few questions.  Have you done any really heavy testing on your server to
> make sure it has no problems with its hardware or anything?  I've seen
> machines with memory errors or bad blocks on the hard drive slip into
> production and wreak havoc due to slow corruption of a database.

Also, if you are on ix86 hardware, try running memtest86 for a few hours
on it (www.teresaudio.com/memtest86).

--
Alvaro Herrera (<alvherre[a]atentus.com>)
"La realidad se compone de muchos sueños, todos ellos diferentes,
pero en cierto aspecto, parecidos..." (Yo, hablando de sueños eróticos)


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello all,

unfortunately I didn't get a really helping answer from the list yet.

Additionally yesterday night there was again a problem with some SELECTs:

NOTICE:  Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
DB-Error in /web/intern.city-map.de/www/vertrieb/wiedervorlage.pl Code 7:
 server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Command was:
SELECT name
FROM regionen
WHERE region='0119';
 at /web/pm/CityMap/Abfragen.pm line 135

This is really annoying.

When I noticed it this morning, I dropped all indexes and recreated them.
Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
properly again.

I have now created a cronjob that will drop and recreate all indexes on a
daily basis.
But shouldn't this be unnecessary ?

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Henrik Steffen" <steffen@city-map.de>
To: "Martijn van Oosterhout" <kleptog@svana.org>
Cc: "pg" <pgsql-general@postgresql.org>
Sent: Tuesday, June 18, 2002 11:15 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> well, I don't work with transactions at all...
>
> and I can't see no dupes...
>
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
> ----- Original Message -----
> From: "Martijn van Oosterhout" <kleptog@svana.org>
> To: "Henrik Steffen" <steffen@city-map.de>
> Sent: Tuesday, June 18, 2002 1:25 AM
> Subject: Re: [GENERAL] Serious Crash last Friday
>
>
> > On Mon, Jun 17, 2002 at 05:28:23PM +0200, Henrik Steffen wrote:
> > > yes, I was doing it on a living database... not good?
> > >
> > > I was not using -a
> >
> > It doesn't really matter if you're doing it on a live database, it just
> > means that it won't see stuff that is in WAL that has not been committed
> to
> > the main storage.
> >
> > If you're not using -a then something else is going on. Do you have many
> > uncomitted transactions? The program is not particularly intelligent
about
> > which rows are currently active, so you will probably see some
duplicates.
> > Look at the oids shown on the right of each tuple. Also, I've just
> uploaded
> > a new version which deals with dates and times much better.
> >
> > --
> > Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > > There are 10 kinds of people in the world, those that can do binary
> > > arithmetic and those that can't.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
ok, but I don't use any explicit transactions using begin, end, commit or
rollback

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Andrew Sullivan" <andrew@libertyrms.info>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Tuesday, June 18, 2002 4:34 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Tue, Jun 18, 2002 at 11:15:04AM +0200, Henrik Steffen wrote:
> > well, I don't work with transactions at all...
>
> Yes you do.  Everything in Postres is in a transaction.  So if
> someone hasn't finished something, you have an uncommitted
> transaction.
>
> A
>
> --
> ----
> Andrew Sullivan                               87 Mowat Avenue
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M6K 3E3
>                                          +1 416 646 3304 x110
>


Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Wed, Jun 19, 2002 at 11:35:19AM +0200, Henrik Steffen wrote:
> Hello all,
>
> unfortunately I didn't get a really helping answer from the list yet.

We are having trouble helping, because you are not providing the
information that people have asked you for.

> Additionally yesterday night there was again a problem with some SELECTs:
>
> NOTICE:  Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend

As someone else already told you, this is no help because it's a
message about some _other_ back end.  You need to turn on debugging.
Find out where the problem is:

-Track the PIDs
-When this happens, find the PID that is causing the corruption
-Then you can trace further back in your log file to find the query
that was running when it happened.

If the backend is dying abnormally, it should also leave a core file
around, on which you can use a debugger.

> This is really annoying.

But no-one can help you if you don't follow the proffered advice for
how to help.

> When I noticed it this morning, I dropped all indexes and recreated them.
> Then I ran a VACUUM FULL VERBOSE ANALYZE - afterwards the same query worked
> properly again.

The query in question is _not_ the problem.  The error message tells
you so: some _other_ back end is dying.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello,

I am back from vacation and still experiencing strange behaviour of my database:

Dumping all to /var/lib/pgsql/backup/db-backup-20020709.out....
connected to template1...
dumping database "kunden"...
pg_dump: ERROR:  cannot open segment 1 of relation pg_toast_16584 (target block 1048595): No such file or directory
pg_dump: lost synchronization with server, resetting connection
pg_dump: SQL command to dump the contents of table "seiten" failed: PQendcopy() failed.
pg_dump: Error message from server: pg_dump: The command was: COPY "seiten" TO stdout;
pg_dump failed on kunden, exiting

This is my daily dump programm for backup reasons... tonight it didn't succeed,
and I don't understand why.

anyone ever had pg_toast for breakfast ?

any help appreciated

thanks

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------


Attachment

Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Tue, Jul 09, 2002 at 05:33:58PM +0200, Henrik Steffen wrote:

> pg_dump: ERROR:  cannot open segment 1 of relation pg_toast_16584
> (target block 1048595): No such file or directory

Last time I saw a similarly unhappy message, I had a fried hard
drive.  Are you sure you don't have hardware trouble?  Because that's
the first thing I'd look for.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
hello,

I allready have a script that drops and recreates all user-indexes on a
daily basis using DROP and CREATE INDEX.

Now I started the database in single user mode and did a
REINDEX DATABASE kunden FORCE, but this didn't help either.

I will do an fsck tonight to see if something with the hardware is wrong.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Wed, Jul 10, 2002 at 08:44:25AM +0200, Henrik Steffen wrote:
>
> I will do an fsck tonight to see if something with the hardware is wrong.

That won't guarantee you don't have bad hardware.  Are you seeing
anything in the syslog?  If you do a non-destructive badblocks (or
whatever) test, what happens?  I'd expect that, at least, to show up
any hardware problem.

A
--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
excuse me, but what is a "non-destructive badblocks (or
whatever) test" - and how can I do this?

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Andrew Sullivan" <andrew@libertyrms.info>
To: "pg" <pgsql-general@postgresql.org>
Sent: Wednesday, July 10, 2002 2:20 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Wed, Jul 10, 2002 at 08:44:25AM +0200, Henrik Steffen wrote:
> >
> > I will do an fsck tonight to see if something with the hardware is
wrong.
>
> That won't guarantee you don't have bad hardware.  Are you seeing
> anything in the syslog?  If you do a non-destructive badblocks (or
> whatever) test, what happens?  I'd expect that, at least, to show up
> any hardware problem.
>
> A
> --
> ----
> Andrew Sullivan                               87 Mowat Avenue
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M6K 3E3
>                                          +1 416 646 3304 x110
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious Crash last Friday

From
Martijn van Oosterhout
Date:
On Wed, Jul 10, 2002 at 02:51:00PM +0200, Henrik Steffen wrote:
> excuse me, but what is a "non-destructive badblocks (or
> whatever) test" - and how can I do this?

There is a badblocks program which will check for badblocks. I'd also
recommend memtest86 (www.memtest86.com).
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Wed, Jul 10, 2002 at 02:51:00PM +0200, Henrik Steffen wrote:
> excuse me, but what is a "non-destructive badblocks (or
> whatever) test" - and how can I do this?

If you're using linux, try "man badblocks".

On Solaris, a similar command is diskscan.

Other systems will have other names.  I can't recall how to do this
on BSD, and I can't find anything in my local man db at the moment.
I'll bet someone else here knows.

You should check for other hardware faults, too.  A bad memory module
can provide for all sorts of strange errors on your system (although
if it never locks up or crashes, I'd look elsewhere for the problem
first).  In recent releases, PostgreSQL has proven to be very stable,
but you are having a lot of trouble, and many others are not
reporting similar problems.  That leads one to suspect that you have
faulty hardware; it certainly requires, at least, that you eliminate
hardware as a factor.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hi,

thanks for the information...

the badblocks read-only test did not report any problems,
do you think i should run the "read-write" test, too?

i did the last initdb 3 weeks ago and created every table new from dumps.
i am recreating all user-indexes every day, i am vacuuming everything
everyday.

the server has only been running for 4 months, it's brand new hardware
(Intel PIII,
900 MHz, 2 x 60 GB SCSI-Raid 0 disks) ... it has been up and running for 42
days
without reset now (last reset was due to work at powerswitch).

tonight I will have the memory checked by memtest86 ...


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Andrew Sullivan" <andrew@libertyrms.info>
To: "pg" <pgsql-general@postgresql.org>
Sent: Wednesday, July 10, 2002 4:38 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> On Wed, Jul 10, 2002 at 02:51:00PM +0200, Henrik Steffen wrote:
> > excuse me, but what is a "non-destructive badblocks (or
> > whatever) test" - and how can I do this?
>
> If you're using linux, try "man badblocks".
>
> On Solaris, a similar command is diskscan.
>
> Other systems will have other names.  I can't recall how to do this
> on BSD, and I can't find anything in my local man db at the moment.
> I'll bet someone else here knows.
>
> You should check for other hardware faults, too.  A bad memory module
> can provide for all sorts of strange errors on your system (although
> if it never locks up or crashes, I'd look elsewhere for the problem
> first).  In recent releases, PostgreSQL has proven to be very stable,
> but you are having a lot of trouble, and many others are not
> reporting similar problems.  That leads one to suspect that you have
> faulty hardware; it certainly requires, at least, that you eliminate
> hardware as a factor.
>
> A
>
> --
> ----
> Andrew Sullivan                               87 Mowat Avenue
> Liberty RMS                           Toronto, Ontario Canada
> <andrew@libertyrms.info>                              M6K 3E3
>                                          +1 416 646 3304 x110
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Serious Crash last Friday

From
Jan Wieck
Date:
Henrik Steffen wrote:
>
> Hi,
>
> thanks for the information...
>
> [...]

Did you have any OS crash or the like since the last rebuild of the
database? Are there any files in lost+found of the filesystem, your data
directory is on?

The fact that the backend cannot open a segment of a TOAST table raises
the question "how did it get removed?".


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
ok -- now this is what I did now:

I tried to figure out, where exactly the problem was:

The error occured while trying to COPY table seiten to STDOUT

Now I did "SELECT * FROM seiten;" ---> same problem

table seiten looks as follows:

lfdseitenr char(9)
absatznr smallint
absatz text
bildtyp smallint
bildtext text
richt boolean

there are 22409 rows - and there is an index on lfdseitenr

using LIKE I tried to figure out which rows where affected

SELECT * FROM seiten WHERE lfdseitenr LIKE '08%';

finally I found out that it was only ONE single row,
lfdseitenr = 081400023 AND absatznr=3

SELECT lfdseitenr, absatznr, bildtyp, bildtext, richt FROM seiten WHERE
lfdseitenr = '081400023' AND absatznr=3;

so only ONE field in ONE particular row was destroyed (pg_toast - error
message -
no such file or directory)

I now created a temp table selecting EVERYTHING but the affected row and
renamed
the table. So the problem is solved now, but this should actually never
happen.

I have the temp-table left on my system. Is it possible that someone entered
invalid
characters or something into this particular text-fild "absatz" ? Or what
else could
cause this error? Maybe anyone of the postgres-'gurus' wants to have a look
on my
system? Maybe there is a bug anywhere in postgres? Doesn't look like a
hardware-problem
or what do you guys think?


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Henrik Steffen" <steffen@city-map.de>
To: "Andrew Sullivan" <andrew@libertyrms.info>
Cc: "pg" <pgsql-general@postgresql.org>
Sent: Wednesday, July 10, 2002 5:19 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> Hi,
>
> thanks for the information...
>
> the badblocks read-only test did not report any problems,
> do you think i should run the "read-write" test, too?
>
> i did the last initdb 3 weeks ago and created every table new from dumps.
> i am recreating all user-indexes every day, i am vacuuming everything
> everyday.
>
> the server has only been running for 4 months, it's brand new hardware
> (Intel PIII,
> 900 MHz, 2 x 60 GB SCSI-Raid 0 disks) ... it has been up and running for
42
> days
> without reset now (last reset was due to work at powerswitch).
>
> tonight I will have the memory checked by memtest86 ...
>
>
> Mit freundlichem Gruß
>
> Henrik Steffen
> Geschäftsführer
>
> top concepts Internetmarketing GmbH
> Am Steinkamp 7 - D-21684 Stade - Germany
> --------------------------------------------------------
> http://www.topconcepts.com          Tel. +49 4141 991230
> mail: steffen@topconcepts.com       Fax. +49 4141 991233
> --------------------------------------------------------
> 24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
> --------------------------------------------------------
> System-Partner gesucht: http://www.franchise.city-map.de
> --------------------------------------------------------
> Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
> --------------------------------------------------------
>
> ----- Original Message -----
> From: "Andrew Sullivan" <andrew@libertyrms.info>
> To: "pg" <pgsql-general@postgresql.org>
> Sent: Wednesday, July 10, 2002 4:38 PM
> Subject: Re: [GENERAL] Serious Crash last Friday
>
>
> > On Wed, Jul 10, 2002 at 02:51:00PM +0200, Henrik Steffen wrote:
> > > excuse me, but what is a "non-destructive badblocks (or
> > > whatever) test" - and how can I do this?
> >
> > If you're using linux, try "man badblocks".
> >
> > On Solaris, a similar command is diskscan.
> >
> > Other systems will have other names.  I can't recall how to do this
> > on BSD, and I can't find anything in my local man db at the moment.
> > I'll bet someone else here knows.
> >
> > You should check for other hardware faults, too.  A bad memory module
> > can provide for all sorts of strange errors on your system (although
> > if it never locks up or crashes, I'd look elsewhere for the problem
> > first).  In recent releases, PostgreSQL has proven to be very stable,
> > but you are having a lot of trouble, and many others are not
> > reporting similar problems.  That leads one to suspect that you have
> > faulty hardware; it certainly requires, at least, that you eliminate
> > hardware as a factor.
> >
> > A
> >
> > --
> > ----
> > Andrew Sullivan                               87 Mowat Avenue
> > Liberty RMS                           Toronto, Ontario Canada
> > <andrew@libertyrms.info>                              M6K 3E3
> >                                          +1 416 646 3304 x110
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
no, no crash for more than 42 days

last initdb was 3 weeks ago

all lost+found folders on the system are empty.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Jan Wieck" <JanWieck@Yahoo.com>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: "Andrew Sullivan" <andrew@libertyrms.info>; "pg"
<pgsql-general@postgresql.org>
Sent: Wednesday, July 10, 2002 5:40 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> Henrik Steffen wrote:
> >
> > Hi,
> >
> > thanks for the information...
> >
> > [...]
>
> Did you have any OS crash or the like since the last rebuild of the
> database? Are there any files in lost+found of the filesystem, your data
> directory is on?
>
> The fact that the backend cannot open a segment of a TOAST table raises
> the question "how did it get removed?".
>
>
> Jan
>
> --
>
> #======================================================================#
> # It's easier to get forgiveness for being wrong than for being right. #
> # Let's break this rule - forgive me.                                  #
> #================================================== JanWieck@Yahoo.com #
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Wed, Jul 10, 2002 at 05:19:47PM +0200, Henrik Steffen wrote:
> Hi,
>
> thanks for the information...
>
> the badblocks read-only test did not report any problems,
> do you think i should run the "read-write" test, too?

Well, if you do it'll destoy the data, so although it's the only way
to be sure, I wouldn't unless absolutely pushed to do so.  A
read-write badblocks test on a big partition can take many hours.

> tonight I will have the memory checked by memtest86 ...

Yes, that seems a good idea.  Brand new hardware doesn't guarantee
anything, particularly when memory is so fast these days (I've had
DIMMs fail a couple of months after they were new).

A


--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
Tom Lane
Date:
Jan Wieck <JanWieck@Yahoo.com> writes:
> The fact that the backend cannot open a segment of a TOAST table raises
> the question "how did it get removed?".

I am guessing that the segment never existed, and that the real problem
is corruption of an entry in the TOAST table's index.  A bad record
number in an index tuple's heap pointer could produce this symptom.

It might be worth dumping the TOAST index with pg_filedump or some such
tool and looking for silly block numbers.  Examining the pattern of
corruption on the bad page, once found, might give us some hint how it
happened.

            regards, tom lane

Re: Serious Crash last Friday

From
"scott.marlowe"
Date:
On Wed, 10 Jul 2002, Andrew Sullivan wrote:

> On Wed, Jul 10, 2002 at 05:19:47PM +0200, Henrik Steffen wrote:
> > Hi,
> >
> > thanks for the information...
> >
> > the badblocks read-only test did not report any problems,
> > do you think i should run the "read-write" test, too?
>
> Well, if you do it'll destoy the data, so although it's the only way
> to be sure, I wouldn't unless absolutely pushed to do so.  A
> read-write badblocks test on a big partition can take many hours.

This isn't entirely true.  According to bad blocks' man page:

-n     Use  non-destructive  read-write  mode.  By default
       only a  non-destructive  read-only  test  is  done.
       This  option  must  not  be  combined  with  the -w
       option, as they are mutually exclusive.

So, with the -n switch, badblocks will save a sector, do a write / read
test, then restore the sector.

Note that this is pretty slow, as I've tested it before.

> > tonight I will have the memory checked by memtest86 ...
>
> Yes, that seems a good idea.  Brand new hardware doesn't guarantee
> anything, particularly when memory is so fast these days (I've had
> DIMMs fail a couple of months after they were new).

Also, another REALLY good test for bad memory is to build postgresql from
source a couple dozen times, especially with a -j switch set to about 6 or
so.


Re: Serious Crash last Friday

From
Andrew Sullivan
Date:
On Wed, Jul 10, 2002 at 02:16:53PM -0600, scott.marlowe wrote:
> This isn't entirely true.  According to bad blocks' man page:
>
> -n     Use  non-destructive  read-write  mode.  By default

Ah, yes, I forgot about that.  Takes days ;-)

> Also, another REALLY good test for bad memory is to build postgresql from
> source a couple dozen times, especially with a -j switch set to about 6 or
> so.

I've had extremely good luck with memtestx86, though.  Usually, by
the 3rd or 4th test it's spotted something, if something is wrong.

This raises another issue, by the way.  Even big, expensive hardware
can have faulty memory.  But the big, expensive hardware frequently
has ECC RAM, which saves your hide.  I _know_ it's expensive, but
it's worth every penny.  If you're building a database server for
production use, and you haven't specified ECC memory, go back and
think again.  It might save you hours of work some day.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Serious Crash last Friday

From
Bruce Momjian
Date:
Andrew Sullivan wrote:
> On Wed, Jul 10, 2002 at 02:51:00PM +0200, Henrik Steffen wrote:
> > excuse me, but what is a "non-destructive badblocks (or
> > whatever) test" - and how can I do this?
>
> If you're using linux, try "man badblocks".
>
> On Solaris, a similar command is diskscan.
>
> Other systems will have other names.  I can't recall how to do this
> on BSD, and I can't find anything in my local man db at the moment.
> I'll bet someone else here knows.

My Buslogic/Mylex SCSI controller card beeps when it hits a bad SCSI
block.  I didn't know what that sound was until about the 10th time.
:-)

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hello Tom,

I now did a "pg_filedump -R 1048595 16584" on the file
/var/lib/pgsql/data/base/16556/16584

It delivered the following output. Can you read anything from it?

*****************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: 16584
* Options used: -R 1048595
*
* Dump created on: Thu Jul 11 08:13:47 2002
*****************************************************************

Block 1048595 ******************************************************
<Header> -----
 Block Offset: 0x00026000         Offsets: Lower     156 (0x009c)
 Block Size: 8192                          Upper     200 (0x00c8)
 LSN:  logid      0 recoff 0x00982b4c      Special  8192 (0x2000)
 Items:   34                   Free Space:   44
 Length (including item array): 160

<Data> ------
 Item   1 -- Length:  613  Offset: 7576 (0x1d98)  Flags: USED
 Item   2 -- Length:   61  Offset: 7512 (0x1d58)  Flags: USED
 Item   3 -- Length:   83  Offset: 7428 (0x1d04)  Flags: USED
 Item   4 -- Length:  109  Offset: 7316 (0x1c94)  Flags: USED
 Item   5 -- Length:  405  Offset: 6908 (0x1afc)  Flags: USED
 Item   6 -- Length:   61  Offset: 6844 (0x1abc)  Flags: USED
 Item   7 -- Length:  397  Offset: 6444 (0x192c)  Flags: USED
 Item   8 -- Length:   61  Offset: 6380 (0x18ec)  Flags: USED
 Item   9 -- Length:  729  Offset: 5648 (0x1610)  Flags: USED
 Item  10 -- Length:   61  Offset: 5584 (0x15d0)  Flags: USED
 Item  11 -- Length:  881  Offset: 4700 (0x125c)  Flags: USED
 Item  12 -- Length:  457  Offset: 4240 (0x1090)  Flags: USED
 Item  13 -- Length:   61  Offset: 4176 (0x1050)  Flags: USED
 Item  14 -- Length:  345  Offset: 3828 (0x0ef4)  Flags: USED
 Item  15 -- Length:   61  Offset: 3764 (0x0eb4)  Flags: USED
 Item  16 -- Length:  329  Offset: 3432 (0x0d68)  Flags: USED
 Item  17 -- Length:   61  Offset: 3368 (0x0d28)  Flags: USED
 Item  18 -- Length:  457  Offset: 2908 (0x0b5c)  Flags: USED
 Item  19 -- Length:   61  Offset: 2844 (0x0b1c)  Flags: USED
 Item  20 -- Length:  485  Offset: 2356 (0x0934)  Flags: USED
 Item  21 -- Length:   61  Offset: 2292 (0x08f4)  Flags: USED
 Item  22 -- Length:   61  Offset: 2228 (0x08b4)  Flags: USED
 Item  23 -- Length:   61  Offset: 2164 (0x0874)  Flags: USED
 Item  24 -- Length:   61  Offset: 2100 (0x0834)  Flags: USED
 Item  25 -- Length:  321  Offset: 1776 (0x06f0)  Flags: USED
 Item  26 -- Length:   61  Offset: 1712 (0x06b0)  Flags: USED
 Item  27 -- Length:  168  Offset: 1544 (0x0608)  Flags: USED
 Item  28 -- Length:   88  Offset: 1456 (0x05b0)  Flags: USED
 Item  29 -- Length:   97  Offset: 1356 (0x054c)  Flags: USED
 Item  30 -- Length:   61  Offset: 1292 (0x050c)  Flags: USED
 Item  31 -- Length:  157  Offset: 1132 (0x046c)  Flags: USED
 Item  32 -- Length:  801  Offset:  328 (0x0148)  Flags: USED
 Item  33 -- Length:   61  Offset:  264 (0x0108)  Flags: USED
 Item  34 -- Length:   61  Offset:  200 (0x00c8)  Flags: USED


*** End of Requested Range Encountered. Last Block Read: 1048595 ***

I still get the following error, even on the old renamed temp-table:
ERROR:  cannot open segment 1 of relation pg_toast_16584 (target block
1048595): No such file or directory

Why is that? The file 16584 exists. The target block exists, though it seems
to be the last
existing block in this particular file.

I even tried the -d option on pg_filedump ... it delivers a 36 MB file...
I can send it to you privately if you want.

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
Hi once more,

I found the damaged section in the >30 MB file:

  0940: d0930100 00000000 00000000 02000000  ................
  0950: 00000000 00005b02 10000600 0e092000  ......[....... .
  0960: 0d000000 30383134 30303032 33000300  ....081400023...
  0970: 140000c0 f3170000 a10f0000 d1930100  ................
  0980: ca400000 0000726f 04000000 01000000  .@....ro........
  0990: cf930100 00000000 00000000 02000000  ................

this is table "altseiten" where lfdseitenr=081400023 and absatznr=3
what's wrong here?

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
Tom Lane
Date:
"Henrik Steffen" <steffen@city-map.de> writes:
> I now did a "pg_filedump -R 1048595 16584" on the file
> /var/lib/pgsql/data/base/16556/16584

What file is that?  Doesn't look like it is an index ...

> It delivered the following output. Can you read anything from it?

Not a lot.  I'd suggest "pg_filedump -i FILENAME" where FILENAME is the
OID of the toast-table index for your problem table.  (Look at
pg_class.reltoastidxid if you're not sure.)  That should produce a ton
of output along the lines of

 Item 155 -- Length:   12  Offset: 4720 (0x1270)  Flags: USED
  Block Id: 4  linp Index: 39  Size: 12
  Has Nulls: 0   Has Varlenas: 0

What you want to look for is outrageously large values in the "Block Id"
field.  Once you find 'em, a "pg_filedump -i -f" of just the block
containing the broken item(s) would be worth studying.

            regards, tom lane

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
16584 was mentioned in the error message
ERROR:  cannot open segment 1 of relation pg_toast_16584

so I did a find in the /var/lib/pgsql folder and found only this particular
file.
I would say, it contains table information/content of the table 'altseiten'
with
the corrupted field.

select reltoastidxid from pg_class where relname='altseiten'
delivers: 0    ???


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------



Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
doing pg_filedump -i FILENAME on the 16584 file I have all Block ids from 0
to 1150


select * from pg_class where oid = 16584
-> relname altseiten
-> reltoastidxid 0

yes, you are right 16584 is the oid of the table altseiten, but the
reltoastidxid is
still 0.


Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: "pg" <pgsql-general@postgresql.org>
Sent: Thursday, July 11, 2002 8:36 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> "Henrik Steffen" <steffen@city-map.de> writes:
> > I now did a "pg_filedump -R 1048595 16584" on the file
> > /var/lib/pgsql/data/base/16556/16584
>
> What file is that?  Doesn't look like it is an index ...
>
> > It delivered the following output. Can you read anything from it?
>
> Not a lot.  I'd suggest "pg_filedump -i FILENAME" where FILENAME is the
> OID of the toast-table index for your problem table.  (Look at
> pg_class.reltoastidxid if you're not sure.)  That should produce a ton
> of output along the lines of
>
>  Item 155 -- Length:   12  Offset: 4720 (0x1270)  Flags: USED
>   Block Id: 4  linp Index: 39  Size: 12
>   Has Nulls: 0   Has Varlenas: 0
>
> What you want to look for is outrageously large values in the "Block Id"
> field.  Once you find 'em, a "pg_filedump -i -f" of just the block
> containing the broken item(s) would be worth studying.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: Serious Crash last Friday

From
Tom Lane
Date:
"Henrik Steffen" <steffen@city-map.de> writes:
> yes, you are right 16584 is the oid of the table altseiten, but the
> reltoastidxid is still 0.

Oh, my mistake --- you have to look at the toast table identified by
reltoastrelid, and then its reltoastidxid fingers the index.  For
example:

test72=# create table foo (t text);
CREATE
test72=# select oid,relname,relkind,reltoastrelid,reltoastidxid from pg_class order by oid desc limit 5;
  oid   |       relname       | relkind | reltoastrelid | reltoastidxid
--------+---------------------+---------+---------------+---------------
 812033 | pg_toast_812029_idx | i       |             0 |             0
 812031 | pg_toast_812029     | t       |             0 |        812033
 812029 | foo                 | r       |        812031 |             0
 812026 | manufacturer_z_key  | i       |             0 |             0
 812024 | manufacturer        | r       |             0 |             0
(5 rows)

Here, foo is the base table, pg_toast_812029 is its toast table,
pg_toast_812029_idx is the index.

            regards, tom lane

Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
ok, I now found out that the table altseiten with the oid 16584
has got the reltoastrelid 16586. this is the oid of pg_toast_16584 which
has the reltoastidxid 16588. this again is the oid of pg_toast_16584_idx.

now doing pg_filedump -i 16588 gives 1213 lines, and in Item 109 there is a
block id: 1048595

Then Item 278 has got Block Id 43, Item 279 Block Id 1, Item 280 Block Id 1,
Item 281 Block Id 52 ... looks strange, too, doesn't it?

what does this mean?

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Henrik Steffen" <steffen@city-map.de>
Cc: "pg" <pgsql-general@postgresql.org>
Sent: Thursday, July 11, 2002 9:00 AM
Subject: Re: [GENERAL] Serious Crash last Friday


> "Henrik Steffen" <steffen@city-map.de> writes:
> > yes, you are right 16584 is the oid of the table altseiten, but the
> > reltoastidxid is still 0.
>
> Oh, my mistake --- you have to look at the toast table identified by
> reltoastrelid, and then its reltoastidxid fingers the index.  For
> example:
>
> test72=# create table foo (t text);
> CREATE
> test72=# select oid,relname,relkind,reltoastrelid,reltoastidxid from
pg_class order by oid desc limit 5;
>   oid   |       relname       | relkind | reltoastrelid | reltoastidxid
> --------+---------------------+---------+---------------+---------------
>  812033 | pg_toast_812029_idx | i       |             0 |             0
>  812031 | pg_toast_812029     | t       |             0 |        812033
>  812029 | foo                 | r       |        812031 |             0
>  812026 | manufacturer_z_key  | i       |             0 |             0
>  812024 | manufacturer        | r       |             0 |             0
> (5 rows)
>
> Here, foo is the base table, pg_toast_812029 is its toast table,
> pg_toast_812029_idx is the index.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


Re: Serious Crash last Friday

From
"Henrik Steffen"
Date:
aha! no, i didn't check the mem yet. but it seems like it's worth to
buy new ECC-RAM chips, if only ONE SINGLE FAULTY BIT can cause such
trouble!!

thanks for the detailed examination!

Mit freundlichem Gruß

Henrik Steffen
Geschäftsführer

top concepts Internetmarketing GmbH
Am Steinkamp 7 - D-21684 Stade - Germany
--------------------------------------------------------
http://www.topconcepts.com          Tel. +49 4141 991230
mail: steffen@topconcepts.com       Fax. +49 4141 991233
--------------------------------------------------------
24h-Support Hotline:  +49 1908 34697 (EUR 1.86/Min,topc)
--------------------------------------------------------
System-Partner gesucht: http://www.franchise.city-map.de
--------------------------------------------------------
Handelsregister: AG Stade HRB 5811 - UstId: DE 213645563
--------------------------------------------------------

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Henrik Steffen" <steffen@city-map.de>
Sent: Thursday, July 11, 2002 3:50 PM
Subject: Re: [GENERAL] Serious Crash last Friday


> oh, ignore previous message --- it's still too early in the morning.
>
> What you seem to have here is a single bit flip:
>
>  Item 109 -- Length:   16  Offset: 6432 (0x1920)  Flags: USED
>   Block Id: 1048595  linp Index: 3  Size: 16
>   Has Nulls: 0   Has Varlenas: 0
>
>   1920: 10001300 03001000 d1930100 02000000  ................
>
> Judging from the surrounding items, the block id in this entry should
> have been 19 (hex 00000013) ... but it's actually 00100013.  Everything
> else looks fine on the whole page.
>
> My bet is that you have a bad RAM chip that allowed the bit to flip
> while the index page was in memory.  Single-bit changes aren't usually
> what I'd expect from either software bugs or disk-related hardware
> failures.
>
> Have you done anything yet with that memory test program that some other
> folks recommended?
>
> regards, tom lane


Re: Serious Crash last Friday

From
Tom Lane
Date:
"Henrik Steffen" <steffen@city-map.de> writes:
> now doing pg_filedump -i 16588 gives 1213 lines, and in Item 109 there is a
> block id: 1048595

Ah-hah.  Would you send me "pg_filedump -i -f" output for that file?

            regards, tom lane

Re: Serious Crash last Friday

From
fetchmail@datas-world.dyndns.org
Date:
Comments: In-reply-to "Henrik Steffen" <steffen@city-map.de>
    message dated "Thu, 11 Jul 2002 11:09:34 +0200"
Date: Thu, 11 Jul 2002 09:23:29 -0400
Message-ID: <16326.1026393809@sss.pgh.pa.us>
From: Tom Lane <tgl@sss.pgh.pa.us>
Precedence: bulk
Sender: pgsql-general-owner@postgresql.org

"Henrik Steffen" <steffen@city-map.de> writes:
> now doing pg_filedump -i 16588 gives 1213 lines, and in Item 109 there is a
> block id: 1048595

Ah-hah.  Would you send me "pg_filedump -i -f" output for that file?

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster