Thread: BUG #3484: Missing pg_clog file / corrupt index

BUG #3484: Missing pg_clog file / corrupt index

From

"Marc Schablewski"

Date:

25 July 2007, 11:02:34

The following bug has been logged online:

Bug reference:      3484
Logged by:          Marc Schablewski
Email address:      ms@clickware.de
PostgreSQL version: 8.1.8
Operating system:   SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
Description:        Missing pg_clog file / corrupt index
Details:

Our application receives and processes payment information that comes in
plain text files and stores the processed data into different tables in our
database. There are about 5 tables involved, three of them with 35 million
records so far. We get approximately 150000 payments a day. Each payment is
handled in a single transaction, because in case of an error, we want to
store as many payments as possible. We have about 500000 INSERT and a few
UPDATE statements each day. The whole application runs on two servers (see
specs below) which are nearly identical. One is the production system, the
other is for testing.

A few months ago we had some trouble with the test system. Postgres
complained about a missing pg_clog file during nightly routine
VACUUM/ANALYZE. Some days later, the same error occurred on the production
system, even on the same table! The corrupted table is one of those bigger
ones involved into the file processing. After searching the web we found a
hint that this problem could be related to a bug in 8.1.3, so we upgraded to
8.1.8 and restored the databases on both servers. This was around May, but
now we discovered the same problem on our production system again. Actually,
the error occurred four weeks ago, but it was not discovered until now â
if you do error logging, you should look at it from time to time ;)

When trying to backup or vacuum the database, we receive one of the
following error messages:

**BACKUP**
pg_dump: ERROR:  could not access status of transaction 2134240
DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden
pg_dump: SQL command to dump the contents of table "dateiblock" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR:  could not access status of
transaction 2134240
DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden
pg_dump: The command was: COPY public.dateiblock (id, eda_id, dbt_id, lfdnr,
binaer_offset, laenge) TO stdout;

**VACUUM**
INFO:  vacuuming "public.dateiblock"
ERROR:  could not access status of transaction 2134240
DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis nicht
gefunden

(For those not familiar to the German language: âDatei oder Verzeichnis
nicht gefundenâ means âfile or directory not foundâ).

Current pg_clogs range from 005A to 008F. The oldest one is dated to April
30th.

We narrowed it down to a few records in that table. Some records contain
unreasonable values, others produce the same message about the missing
pg_clog file when selected and some are simply missing. But they must have
existed, because there are still records in a second table referencing them.
One strange thing about this is, that the referencing records are about two
and a half months old and shouldnât been touched since then.

We donât think this is a hardware issue, because we had it on two
different servers and within a short period of time.

Luckily, the loss of data is minimal. There are only about 30 records
affected. Otherwise this would have been fatal, because as said before, our
backup was not working either.


In addition, we had another problem today. One of the indexes on a second
table became corrupted, causing the database backend to rollback all
transactions and restart the processes. A REINDEX fixed it, but it leaves me
with a bad feeling about what will break next.

2007-07-25 08:07:00 CEST PANIC:  right sibling's left-link doesn't match
2007-07-25 08:07:00 CEST STATEMENT:  insert into transaktion (kaz_id,
dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
kartennr
, wae_id, kassenschnitt, freigabe, flo_id, importdatum)  VALUES ($1, $2, $3,
$4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
2007-07-25 08:07:00 CEST LOG:  server process (PID 5699) was terminated by
signal 6
2007-07-25 08:07:00 CEST LOG:  terminating any other active server
processes
2007-07-25 08:07:00 CEST WARNING:  terminating connection because of crash
of another server process
2007-07-25 08:07:00 CEST DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.


Kind regards,

Marc Schablewski



System:

OS:    SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64

System:    2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP proliant
server)
3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)

Postgres 8.1.8

Postgres was compiled manually but with default parameters. The
configuration has been tuned to improve performance.

shared_buffers = 30000
maintenance_work_mem = 131072
max_fsm_pages = 500000
default_statistics_target = 200
redirect_stderr = on
log_directory = '/var/log/pg_log'

log_min_duration_statement = 10
log_duration = on
log_line_prefix = '%t '
stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = off

VACUUM ANALYZE runs once a day, auto_vacuum is off.

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Decibel!

Date:

31 July 2007, 02:18:00

On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote:
> The following bug has been logged online:
>
> Bug reference:      3484
> Logged by:          Marc Schablewski
> Email address:      ms@clickware.de
> PostgreSQL version: 8.1.8
> Operating system:   SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
> Description:        Missing pg_clog file / corrupt index
> Details:
>
> Our application receives and processes payment information that=20=20
> comes in
> plain text files and stores the processed data into different=20=20
> tables in our
> database. There are about 5 tables involved, three of them with 35=20=20
> million
> records so far. We get approximately 150000 payments a day. Each=20=20
> payment is
> handled in a single transaction, because in case of an error, we=20=20
> want to
> store as many payments as possible. We have about 500000 INSERT and=20=20
> a few
> UPDATE statements each day. The whole application runs on two=20=20
> servers (see
> specs below) which are nearly identical. One is the production=20=20
> system, the
> other is for testing.
>
> A few months ago we had some trouble with the test system. Postgres
> complained about a missing pg_clog file during nightly routine
> VACUUM/ANALYZE. Some days later, the same error occurred on the=20=20
> production
> system, even on the same table! The corrupted table is one of those=20=20
> bigger
> ones involved into the file processing. After searching the web we=20=20
> found a
> hint that this problem could be related to a bug in 8.1.3, so we=20=20
> upgraded to
> 8.1.8 and restored the databases on both servers. This was around=20=20
> May, but
> now we discovered the same problem on our production system again.=20=20
> Actually,
> the error occurred four weeks ago, but it was not discovered until=20=20
> now =96
> if you do error logging, you should look at it from time to time ;)
>
> When trying to backup or vacuum the database, we receive one of the
> following error messages:
>
> **BACKUP**
> pg_dump: ERROR:  could not access status of transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
> pg_dump: SQL command to dump the contents of table "dateiblock"=20=20
> failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  could not access status of
> transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
> pg_dump: The command was: COPY public.dateiblock (id, eda_id,=20=20
> dbt_id, lfdnr,
> binaer_offset, laenge) TO stdout;
>
> **VACUUM**
> INFO:  vacuuming "public.dateiblock"
> ERROR:  could not access status of transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
>
> (For those not familiar to the German language: =91Datei oder=20=20
> Verzeichnis
> nicht gefunden=92 means =91file or directory not found=92).
>
> Current pg_clogs range from 005A to 008F. The oldest one is dated=20=20
> to April
> 30th.
>
> We narrowed it down to a few records in that table. Some records=20=20
> contain
> unreasonable values, others produce the same message about the missing
> pg_clog file when selected and some are simply missing. But they=20=20
> must have
> existed, because there are still records in a second table=20=20
> referencing them.
> One strange thing about this is, that the referencing records are=20=20
> about two
> and a half months old and shouldn=92t been touched since then.
>
> We don=92t think this is a hardware issue, because we had it on two
> different servers and within a short period of time.
>
> Luckily, the loss of data is minimal. There are only about 30 records
> affected. Otherwise this would have been fatal, because as said=20=20
> before, our
> backup was not working either.
>
>
> In addition, we had another problem today. One of the indexes on a=20=20
> second
> table became corrupted, causing the database backend to rollback all
> transactions and restart the processes. A REINDEX fixed it, but it=20=20
> leaves me
> with a bad feeling about what will break next.
>
> 2007-07-25 08:07:00 CEST PANIC:  right sibling's left-link doesn't=20=20
> match
> 2007-07-25 08:07:00 CEST STATEMENT:  insert into transaktion (kaz_id,
> dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
> kartennr
> , wae_id, kassenschnitt, freigabe, flo_id, importdatum)  VALUES=20=20
> ($1, $2, $3,
> $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
> 2007-07-25 08:07:00 CEST LOG:  server process (PID 5699) was=20=20
> terminated by
> signal 6
> 2007-07-25 08:07:00 CEST LOG:  terminating any other active server
> processes
> 2007-07-25 08:07:00 CEST WARNING:  terminating connection because=20=20
> of crash
> of another server process
> 2007-07-25 08:07:00 CEST DETAIL:  The postmaster has commanded this=20=20
> server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
>
> Kind regards,
>
> Marc Schablewski
>
>
>
> System:
>
> OS:    SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
>
> System:    2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP=20=20
> proliant
> server)
> 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)
>
> Postgres 8.1.8
>
> Postgres was compiled manually but with default parameters. The
> configuration has been tuned to improve performance.
>
> shared_buffers =3D 30000
> maintenance_work_mem =3D 131072
> max_fsm_pages =3D 500000
> default_statistics_target =3D 200
> redirect_stderr =3D on
> log_directory =3D '/var/log/pg_log'
>
> log_min_duration_statement =3D 10
> log_duration =3D on
> log_line_prefix =3D '%t '
> stats_start_collector =3D on
> stats_command_string =3D on
> stats_block_level =3D on
> stats_row_level =3D on
> stats_reset_on_server_start =3D off
>
> VACUUM ANALYZE runs once a day, auto_vacuum is off.

Actually, this does sound like a hardware problem to me. You only=20=20
have 5 tables that get hit heavily, so you've likely got somewhere=20=20
around a 20% chance that corruption would hit the same table on two=20=20
different machines.

So far you haven't said anything that sounds unusual about how you're=20=20
using the database, and the hardware certainly seems pretty common-=20
place, so I'm rather doubtful that it's software. But if you still=20=20
have copies of the bad database, someone might be able to help you.
--=20
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Marc Schablewski

Date:

31 July 2007, 07:25:51

I kept a copy of the data files in case it is needed, but I have to
check first, if I am allowed to give away that information. Some of the
data is confidential. If you just need the files containing the dammaged
table, this won't be a big problem, because it does not contain any
confidential information (as long as one data file only contains the
data of one table). The other problem is the size of the files. The
whole database is about 60GB and the files belonging to that table are
about 2.5GB. Mayby there is a way to pre-select the data you need?

Decibel! wrote:
> Actually, this does sound like a hardware problem to me. You only have 5
> tables that get hit heavily, so you've likely got somewhere around a 20%
> chance that corruption would hit the same table on two different machines.
>
> So far you haven't said anything that sounds unusual about how you're
> using the database, and the hardware certainly seems pretty
> common-place, so I'm rather doubtful that it's software. But if you
> still have copies of the bad database, someone might be able to help you.

--

Marc Schablewski
click:ware Informationstechnik GmbH

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Gregory Stark

Date:

31 July 2007, 08:00:28

"Marc Schablewski" <ms@clickware.de> writes:

> I kept a copy of the data files in case it is needed, but I have to
> check first, if I am allowed to give away that information. Some of the
> data is confidential. If you just need the files containing the dammaged
> table, this won't be a big problem, because it does not contain any
> confidential information (as long as one data file only contains the
> data of one table). The other problem is the size of the files. The
> whole database is about 60GB and the files belonging to that table are
> about 2.5GB. Mayby there is a way to pre-select the data you need?

Perhaps. You could find the records with unreasonable values. But I don't
think there's any convenient way to find the records which produce the clog
error or which are missing unless they turn out to be on the same page.

Out of curiosity, what do the unreasonable values look like?

Earlier you said:

> We narrowed it down to a few records in that table. Some records contain
> unreasonable values, others produce the same message about the missing
> pg_clog file when selected and some are simply missing. But they must have
> existed, because there are still records in a second table referencing them.

If you still have a live database with this data then if you can do

  SELECT ctid FROM tab WHERE ...

for the records with unreasonable values that might tell you what blocks are
corrupted. The value before the comma is the block number, which when
multiplied by 8192 (assuming you're using 8k blocks) will tell you what file
offset to look for the page.

To find the file to look for the block in do:

postgres=# select relfilenode from pg_class where relname = 'tablename';
 relfilenode
-------------
       16384
(1 row)

Note that if the file offset is over 1G then you would be looking for a file
named 16384.N where N is which gigabyte chunk.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: BUG #3484: Missing pg_clog file / corrupt index

From

"Feng Chen"

Date:

24 August 2007, 10:32:16

We also have the same exact problem - every 5 to 10 days when the data
get to some size, PostgreSQL complains about missing pg_clog files, and
invalid page headers during either vacuum or reindex operations.

The problem happens on different customer sites with Linux 2.6.11.

There is one particular table that is much more heavily used - 99%
inserts, some selects. And this table turns to be the one having
problems.

If this happens at different companies so frequently, I'd doubt it's a
hardware problem. And we did check the hardware but did not find any
problems.

Here are some sample messages:=20

2007-07-07T10:11:38+00:00 ERROR:  could not access status of transaction
842085945
2007-07-07T10:11:38+00:00 DETAIL:  could not open file "pg_clog/0323":
No such file or directory

2007-08-07T08:10:23+00:00 ERROR:  could not access status of transaction
1481866610
2007-08-07T08:10:24+00:00 DETAIL:  could not open file "pg_clog/0585":
No such file or directory
2007-08-07T08:13:55+00:00 ERROR:  invalid page header in block 346965 of
relation "sipmessage"
2007-08-07T08:30:16+00:00 ERROR:  could not access status of transaction
1481866610
2007-08-07T08:30:16+00:00 DETAIL:  could not open file "pg_clog/0585":
No such file or directory
2007-08-07T08:34:08+00:00 ERROR:  invalid page header in block 346965 of
relation "sipmessage"
2007-08-07T08:51:02+00:00 ERROR:  could not access status of transaction
1481866610
2007-08-07T08:51:02+00:00  DETAIL:  could not open file "pg_clog/0585":
No such file or directory

2007-08-13T10:12:07+00:00 ERROR:  invalid page header in block 4018 of
relation "calllegstart_sessionid"
2007-08-13T10:12:15+00:00  ERROR:  could not access status of
transaction 0
2007-08-13T10:12:15+00:00  DETAIL:  could not create file
"pg_subtrans/0201": File exists

-----Original Message-----
From: pgsql-bugs-owner@postgresql.org
[mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Gregory Stark
Sent: Tuesday, July 31, 2007 7:00 AM
To: Marc Schablewski
Cc: Decibel!; pgsql-bugs@postgresql.org
Subject: Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index

"Marc Schablewski" <ms@clickware.de> writes:

> I kept a copy of the data files in case it is needed, but I have to
> check first, if I am allowed to give away that information. Some of
the
> data is confidential. If you just need the files containing the
dammaged
> table, this won't be a big problem, because it does not contain any
> confidential information (as long as one data file only contains the
> data of one table). The other problem is the size of the files. The
> whole database is about 60GB and the files belonging to that table are
> about 2.5GB. Mayby there is a way to pre-select the data you need?

Perhaps. You could find the records with unreasonable values. But I
don't
think there's any convenient way to find the records which produce the
clog
error or which are missing unless they turn out to be on the same page.

Out of curiosity, what do the unreasonable values look like?

Earlier you said:

> We narrowed it down to a few records in that table. Some records
contain
> unreasonable values, others produce the same message about the missing
> pg_clog file when selected and some are simply missing. But they must
have
> existed, because there are still records in a second table referencing
them.

If you still have a live database with this data then if you can do

  SELECT ctid FROM tab WHERE ...

for the records with unreasonable values that might tell you what blocks
are
corrupted. The value before the comma is the block number, which when
multiplied by 8192 (assuming you're using 8k blocks) will tell you what
file
offset to look for the page.

To find the file to look for the block in do:

postgres=3D# select relfilenode from pg_class where relname =3D 'tablename';
 relfilenode=20
-------------
       16384
(1 row)

Note that if the file offset is over 1G then you would be looking for a
file
named 16384.N where N is which gigabyte chunk.

--=20
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Andrew Sullivan

Date:

24 August 2007, 10:54:59

On Fri, Aug 24, 2007 at 09:19:49AM -0400, Feng Chen wrote:
> The problem happens on different customer sites with Linux 2.6.11.

Well, at kernel.org in the changelog for 2.6.12, I see this:

    [PATCH] PCI: don't override drv->shutdown unconditionally

    There are many drivers that have been setting the generic driver
    model level shutdown callback, and pci thus must not override it.

    Without this patch we can have really bad data loss on various
    raid controllers.

Could it be related to your problem?  (I'm not trying to dismiss --
I've just been bitten by too many Linux corner cases not to worry
about the kernel.)

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.
        --H.W. Fowler

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Stefan Kaltenbrunner

Date:

24 August 2007, 10:57:29

Feng Chen wrote:
> We also have the same exact problem - every 5 to 10 days when the data
> get to some size, PostgreSQL complains about missing pg_clog files, and
> invalid page headers during either vacuum or reindex operations.
>
> The problem happens on different customer sites with Linux 2.6.11.
>
> There is one particular table that is much more heavily used - 99%
> inserts, some selects. And this table turns to be the one having
> problems.
>
> If this happens at different companies so frequently, I'd doubt it's a
> hardware problem. And we did check the hardware but did not find any
> problems.
>
[...]
> 2007-08-13T10:12:07+00:00 ERROR:  invalid page header in block 4018 of
> relation "calllegstart_sessionid"
> 2007-08-13T10:12:15+00:00  ERROR:  could not access status of
> transaction 0
> 2007-08-13T10:12:15+00:00  DETAIL:  could not create file
> "pg_subtrans/0201": File exists

what version of postgresql is this exactly ? there is a problem in older
 8.1 versions that could cause this error under high transaction rates.


Stefan

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Alvaro Herrera

Date:

24 August 2007, 11:02:34

Feng Chen wrote:
> We also have the same exact problem - every 5 to 10 days when the data
> get to some size, PostgreSQL complains about missing pg_clog files, and
> invalid page headers during either vacuum or reindex operations.

What PG version is this?

> 2007-08-13T10:12:15+00:00  ERROR:  could not access status of
> transaction 0
> 2007-08-13T10:12:15+00:00  DETAIL:  could not create file
> "pg_subtrans/0201": File exists

Huh???

--
Alvaro Herrera                          Developer, http://www.PostgreSQL.org/
"Aprende a avergonzarte más ante ti que ante los demás" (Demócrito)

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

"Feng Chen"

Date:

24 August 2007, 11:07:03

VERSION =3D PostgreSQL 8.1.2

-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@commandprompt.com]=20
Sent: Friday, August 24, 2007 10:02 AM
To: Feng Chen
Cc: Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index

Feng Chen wrote:
> We also have the same exact problem - every 5 to 10 days when the data
> get to some size, PostgreSQL complains about missing pg_clog files, and
> invalid page headers during either vacuum or reindex operations.

What PG version is this?

> 2007-08-13T10:12:15+00:00  ERROR:  could not access status of
> transaction 0
> 2007-08-13T10:12:15+00:00  DETAIL:  could not create file
> "pg_subtrans/0201": File exists

Huh???

--=20
Alvaro Herrera                          Developer, http://www.PostgreSQL.or=
g/
"Aprende a avergonzarte m=E1s ante ti que ante los dem=E1s" (Dem=F3crito)

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Andrew Sullivan

Date:

24 August 2007, 11:11:24

On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
> VERSION = PostgreSQL 8.1.2

You know that the project doesn't put out maintenance releases for
the fun of it, right?  The latest is 8.1.9 in that series.  You need
to upgrade.

A
--
Andrew Sullivan  | ajs@crankycanuck.ca
Unfortunately reformatting the Internet is a little more painful
than reformatting your hard drive when it gets out of whack.
        --Scott Morris

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Stefan Kaltenbrunner

Date:

24 August 2007, 11:13:37

Feng Chen wrote:
> VERSION = PostgreSQL 8.1.2

this could be
http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php

which is fixed in 8.1.3 and later - so you really should look into
upgrading to 8.1.9 as soon as possible ,,,


Stefan

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Andrew Sullivan

Date:

24 August 2007, 11:14:39

On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote:
> On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
> > VERSION = PostgreSQL 8.1.2
>
> You know that the project doesn't put out maintenance releases for
> the fun of it, right?  The latest is 8.1.9 in that series.  You need
> to upgrade.

Indeed, looking at the release notes:

#

Fix race condition that could lead to "file already exists" errors
during pg_clog and pg_subtrans file creation (Tom)
#

--
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
        --Damien Katz

Re: [SPAM] Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

"Feng Chen"

Date:

24 August 2007, 11:17:17

Thanks for the replies! I guess it's time for 8.1.9 then. Will keep you
informed afterwards!

-----Original Message-----
From: Andrew Sullivan [mailto:ajs@crankycanuck.ca]=20
Sent: Friday, August 24, 2007 10:14 AM
To: Andrew Sullivan
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index

On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote:
> On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
> > VERSION =3D PostgreSQL 8.1.2
>=20
> You know that the project doesn't put out maintenance releases for
> the fun of it, right?  The latest is 8.1.9 in that series.  You need
> to upgrade.

Indeed, looking at the release notes:

#

Fix race condition that could lead to "file already exists" errors
during pg_clog and pg_subtrans file creation (Tom)
#

--=20
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard=20
to use, but at least there is a lot of code underneath."
        --Damien Katz

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Tom Lane

Date:

24 August 2007, 13:11:13

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Feng Chen wrote:
>> VERSION = PostgreSQL 8.1.2
> this could be
> http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php

Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.

The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption.  Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple.  This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using.  (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)

I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.

            regards, tom lane

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Gregory Stark

Date:

24 August 2007, 13:46:48

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Remember that the xmin/xmax fields are basically the first thing we can
> check with any degree of strictness when examining a tuple. This means that
> if a page is partially clobbered, but not in a way that sets off the
> invalid-page-header checks, then the odds are very high that the first
> detectable sign of trouble will be references to transaction numbers that
> are far away from what the system is really using.

I'm increasingly thinking that one of the first things I'll suggest putting
into 8.4 is a per-page checksum after all. It was talked about a while back
and people thought it was pointless but I think the number of reports of
hardware and kernel bugs resulting in zeroed and corrupted pages has been
steadily going up. If not in total than as a percentage of the total problems.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [SPAM] Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

"Feng Chen"

Date:

24 August 2007, 14:41:39

Yes, most of the time PostgreSQL only complains about missing pg_clog/
files then complains about invalid page headers. The "could not create
... File exists" message was only seen a couple of times.

I don't have a system that exhibits this problem right now. I can check
the XID counter when it happens again.

Thanks!

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20
Sent: Friday, August 24, 2007 12:11 PM
To: Stefan Kaltenbrunner
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index=20

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Feng Chen wrote:
>> VERSION =3D PostgreSQL 8.1.2
> this could be
> http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php

Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.

The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption.  Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple.  This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using.  (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)

I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.

            regards, tom lane

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Tom Lane

Date:

24 August 2007, 15:01:14

Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Remember that the xmin/xmax fields are basically the first thing we can
>> check with any degree of strictness when examining a tuple. This means that
>> if a page is partially clobbered, but not in a way that sets off the
>> invalid-page-header checks, then the odds are very high that the first
>> detectable sign of trouble will be references to transaction numbers that
>> are far away from what the system is really using.

> I'm increasingly thinking that one of the first things I'll suggest putting
> into 8.4 is a per-page checksum after all. It was talked about a while back
> and people thought it was pointless but I think the number of reports of
> hardware and kernel bugs resulting in zeroed and corrupted pages has been
> steadily going up. If not in total than as a percentage of the total problems.

It's still pointless; a checksum does nothing to prevent data
corruption.  The error report might be slightly more obvious to a novice
but it doesn't bring your data back.

Something we could possibly do now is to modify these error messages:
if the transaction number we're trying to check is obviously bogus
(beyond the current XID counter or older than the current freeze
horizon) we could report it as a corrupted XID rather than exposing
the "no such clog segment" condition.

            regards, tom lane

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Gregory Stark

Date:

24 August 2007, 18:07:56

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> It's still pointless; a checksum does nothing to prevent data
> corruption.  The error report might be slightly more obvious to a novice
> but it doesn't bring your data back.

Well if it's a single bit error from bad memory or a torn page from having
full_page_writes turned off then the resulting page could be entirely valid.
The user might not find out about the corrupt data before it's had time to
migrate elsewhere.

Also, the sooner the corrupt data is reported the sooner the user can restore
from backups and avoid further data loss. The last discussion of this feature
concentrated on beingg able to detect torn page corruption with
full_page_writes turned off during recovery.

> Something we could possibly do now is to modify these error messages:
> if the transaction number we're trying to check is obviously bogus
> (beyond the current XID counter or older than the current freeze
> horizon) we could report it as a corrupted XID rather than exposing
> the "no such clog segment" condition.

That would be clever. I take it you mean the invalid values would be those
values older than the actual relfrozenxid. Certainly they should throw some
sort of error instead of trying to find the transaction in the clog.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Bruce Momjian

Date:

24 August 2007, 18:56:48

Andrew Sullivan wrote:
> On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
> > VERSION = PostgreSQL 8.1.2
>
> You know that the project doesn't put out maintenance releases for
> the fun of it, right?  The latest is 8.1.9 in that series.  You need
> to upgrade.

We moved the "you should upgrade for all minor releases" from FAQ text
to a link to our web site:

    http://www.postgresql.org/support/versioning

The problem is that instead of those words being in the FAQ, they are
now linked from the FAQ, and I am concerned that fewer people are seeing
that recommendation.

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: [SPAM] Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

"Feng Chen"

Date:

27 August 2007, 14:39:58

We've got the invalid page header again:

ERROR: invalid page header in block 18223 of relation "calllegstart"

bash-3.00# pg_controldata /var/lib/pgsql/data
pg_control version number:            812
Catalog version number:               200510211
Database system identifier:           5101280766299435989
Database cluster state:               in production
pg_control last modified:             Mon Aug 27 17:32:12 2007
Current log file ID:                  6
Next log file segment:                242
Latest checkpoint location:           6/EFB4C658
Prior checkpoint location:            6/EDC84A9C
Latest checkpoint's REDO location:    6/EFB04720
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          23661755
Latest checkpoint's NextOID:          24576
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Time of latest checkpoint:            Mon Aug 27 17:29:17 2007
Maximum data alignment:               4
Database block size:                  8192
Blocks per segment of large relation: 131072
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Date/time type storage:               floating-point numbers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

-----Original Message-----
From: Feng Chen=20
Sent: Friday, August 24, 2007 1:42 PM
To: 'Tom Lane'; Stefan Kaltenbrunner
Cc: Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!;
pgsql-bugs@postgresql.org
Subject: RE: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog
file / corrupt index=20

Yes, most of the time PostgreSQL only complains about missing pg_clog/
files then complains about invalid page headers. The "could not create
... File exists" message was only seen a couple of times.

I don't have a system that exhibits this problem right now. I can check
the XID counter when it happens again.

Thanks!

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20
Sent: Friday, August 24, 2007 12:11 PM
To: Stefan Kaltenbrunner
Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski;
Decibel!; pgsql-bugs@postgresql.org
Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file /
corrupt index=20

Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Feng Chen wrote:
>> VERSION =3D PostgreSQL 8.1.2
> this could be
> http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php

Note, however, that that only explains the "could not create ... File
exists" complaint, which is hardly the main problem here.

The combination of "invalid page header" and "could not access status of
transaction" messages looks like nothing so much as severe data
corruption.  Remember that the xmin/xmax fields are basically the first
thing we can check with any degree of strictness when examining a
tuple.  This means that if a page is partially clobbered, but not in a
way that sets off the invalid-page-header checks, then the odds are very
high that the first detectable sign of trouble will be references to
transaction numbers that are far away from what the system is really
using.  (Is 1481866610 anywhere near the current XID counter reported
by pg_controldata?)

I concur with the upthread comment to check into kernel bugs,
particularly if all of your machines are using the same old kernel
release.

            regards, tom lane

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Andrew Sullivan

Date:

27 August 2007, 17:50:35

On Fri, Aug 24, 2007 at 05:56:29PM -0400, Bruce Momjian wrote:
> We moved the "you should upgrade for all minor releases" from FAQ text
> to a link to our web site:
>
>     http://www.postgresql.org/support/versioning
>
> The problem is that instead of those words being in the FAQ, they are
> now linked from the FAQ, and I am concerned that fewer people are seeing
> that recommendation.

Could be, but I sort of doubt that's the problem.  I'd be surprised
if people who have this problem are reading the FAQ.  But in any
case, the Q that is listed there doesn't correspond to the linked
text: the question asks about the upgrade procedure, and the
versioning link talks about (initially) which version you should be
running.  I'd break that into two Qs: "What's the upgrade procedure?"
(link to docs) and "What version should I use?" (link to versioning
policy).

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
In the future this spectacle of the middle classes shocking the avant-
garde will probably become the textbook definition of Postmodernism.
                --Brad Holland

Re: [SPAM] Re: BUG #3484: Missing pg_clog file / corrupt index

From

Decibel!

Date:

28 August 2007, 20:47:17

On Fri, Aug 24, 2007 at 05:56:29PM -0400, Bruce Momjian wrote:
> Andrew Sullivan wrote:
> > On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote:
> > > VERSION =3D PostgreSQL 8.1.2
> >=20
> > You know that the project doesn't put out maintenance releases for
> > the fun of it, right?  The latest is 8.1.9 in that series.  You need
> > to upgrade.
>=20
> We moved the "you should upgrade for all minor releases" from FAQ text
> to a link to our web site:
>=20
>     http://www.postgresql.org/support/versioning
>=20
> The problem is that instead of those words being in the FAQ, they are
> now linked from the FAQ, and I am concerned that fewer people are seeing
> that recommendation.

I think the backend should spew WARNINGs with increasing frequency if
it's more than 2 point releases old. And it should refuse to start if
it's more than 5.</sarcasm>

Seriously, I doubt anything short of that will make a big difference,
but certainly putting the actual verbage back in the FAQ can't hurt.
--=20
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: BUG #3484: Missing pg_clog file / corrupt index

From

Marc Schablewski

Date:

04 October 2007, 13:22:55

I've a question that's somehow relateted to this bug. I hope it's ok to
post it here, even if it's not a bug report.

We are planning to set up a standby system, in case our productive
database system crashes again. Replication by WAL archiving is one
possible solution. But the question is: would an error like the one we
had appear in WAL and would it be replicated too? Or is there some kind
of consistency check, that prevents broken WAL from being restored?

Marc Schablewski
click:ware Informationstechnik GmbH


Decibel! wrote:
> On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote:
>> The following bug has been logged online:
>>
>> Bug reference:      3484
>> Logged by:          Marc Schablewski
>> Email address:      ms@clickware.de
>> PostgreSQL version: 8.1.8
>> Operating system:   SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
>> Description:        Missing pg_clog file / corrupt index
>> Details:
>>
>> Our application receives and processes payment information that comes in
>> plain text files and stores the processed data into different tables
>> in our
>> database. There are about 5 tables involved, three of them with 35
>> million
>> records so far. We get approximately 150000 payments a day. Each
>> payment is
>> handled in a single transaction, because in case of an error, we want to
>> store as many payments as possible. We have about 500000 INSERT and a
>> few
>> UPDATE statements each day. The whole application runs on two servers
>> (see
>> specs below) which are nearly identical. One is the production
>> system, the
>> other is for testing.
>>
>> A few months ago we had some trouble with the test system. Postgres
>> complained about a missing pg_clog file during nightly routine
>> VACUUM/ANALYZE. Some days later, the same error occurred on the
>> production
>> system, even on the same table! The corrupted table is one of those
>> bigger
>> ones involved into the file processing. After searching the web we
>> found a
>> hint that this problem could be related to a bug in 8.1.3, so we
>> upgraded to
>> 8.1.8 and restored the databases on both servers. This was around
>> May, but
>> now we discovered the same problem on our production system again.
>> Actually,
>> the error occurred four weeks ago, but it was not discovered until now 
>> if you do error logging, you should look at it from time to time ;)
>>
>> When trying to backup or vacuum the database, we receive one of the
>> following error messages:
>>
>> **BACKUP**
>> pg_dump: ERROR:  could not access status of transaction 2134240
>> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis
>> nicht
>> gefunden
>> pg_dump: SQL command to dump the contents of table "dateiblock" failed:
>> PQendcopy() failed.
>> pg_dump: Error message from server: ERROR:  could not access status of
>> transaction 2134240
>> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis
>> nicht
>> gefunden
>> pg_dump: The command was: COPY public.dateiblock (id, eda_id, dbt_id,
>> lfdnr,
>> binaer_offset, laenge) TO stdout;
>>
>> **VACUUM**
>> INFO:  vacuuming "public.dateiblock"
>> ERROR:  could not access status of transaction 2134240
>> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis
>> nicht
>> gefunden
>>
>> (For those not familiar to the German language: Datei oder Verzeichnis
>> nicht gefunden means file or directory not found).
>>
>> Current pg_clogs range from 005A to 008F. The oldest one is dated to
>> April
>> 30th.
>>
>> We narrowed it down to a few records in that table. Some records contain
>> unreasonable values, others produce the same message about the missing
>> pg_clog file when selected and some are simply missing. But they must
>> have
>> existed, because there are still records in a second table
>> referencing them.
>> One strange thing about this is, that the referencing records are
>> about two
>> and a half months old and shouldnt been touched since then.
>>
>> We dont think this is a hardware issue, because we had it on two
>> different servers and within a short period of time.
>>
>> Luckily, the loss of data is minimal. There are only about 30 records
>> affected. Otherwise this would have been fatal, because as said
>> before, our
>> backup was not working either.
>>
>>
>> In addition, we had another problem today. One of the indexes on a
>> second
>> table became corrupted, causing the database backend to rollback all
>> transactions and restart the processes. A REINDEX fixed it, but it
>> leaves me
>> with a bad feeling about what will break next.
>>
>> 2007-07-25 08:07:00 CEST PANIC:  right sibling's left-link doesn't match
>> 2007-07-25 08:07:00 CEST STATEMENT:  insert into transaktion (kaz_id,
>> dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
>> kartennr
>> , wae_id, kassenschnitt, freigabe, flo_id, importdatum)  VALUES ($1,
>> $2, $3,
>> $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
>> 2007-07-25 08:07:00 CEST LOG:  server process (PID 5699) was
>> terminated by
>> signal 6
>> 2007-07-25 08:07:00 CEST LOG:  terminating any other active server
>> processes
>> 2007-07-25 08:07:00 CEST WARNING:  terminating connection because of
>> crash
>> of another server process
>> 2007-07-25 08:07:00 CEST DETAIL:  The postmaster has commanded this
>> server
>> process to roll back the current transaction and exit, because another
>> server process exited abnormally and possibly corrupted shared memory.
>>
>>
>> Kind regards,
>>
>> Marc Schablewski
>>
>>
>>
>> System:
>>
>> OS:    SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
>>
>> System:    2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP
>> proliant
>> server)
>> 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)
>>
>> Postgres 8.1.8
>>
>> Postgres was compiled manually but with default parameters. The
>> configuration has been tuned to improve performance.
>>
>> shared_buffers = 30000
>> maintenance_work_mem = 131072
>> max_fsm_pages = 500000
>> default_statistics_target = 200
>> redirect_stderr = on
>> log_directory = '/var/log/pg_log'
>>
>> log_min_duration_statement = 10
>> log_duration = on
>> log_line_prefix = '%t '
>> stats_start_collector = on
>> stats_command_string = on
>> stats_block_level = on
>> stats_row_level = on
>> stats_reset_on_server_start = off
>>
>> VACUUM ANALYZE runs once a day, auto_vacuum is off.
>
> Actually, this does sound like a hardware problem to me. You only have
> 5 tables that get hit heavily, so you've likely got somewhere around a
> 20% chance that corruption would hit the same table on two different
> machines.
>
> So far you haven't said anything that sounds unusual about how you're
> using the database, and the hardware certainly seems pretty
> common-place, so I'm rather doubtful that it's software. But if you
> still have copies of the bad database, someone might be able to help you.