Thread: BUG #3484: Missing pg_clog file / corrupt index
The following bug has been logged online: Bug reference: 3484 Logged by: Marc Schablewski Email address: ms@clickware.de PostgreSQL version: 8.1.8 Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64) Description: Missing pg_clog file / corrupt index Details: Our application receives and processes payment information that comes in plain text files and stores the processed data into different tables in our database. There are about 5 tables involved, three of them with 35 million records so far. We get approximately 150000 payments a day. Each payment is handled in a single transaction, because in case of an error, we want to store as many payments as possible. We have about 500000 INSERT and a few UPDATE statements each day. The whole application runs on two servers (see specs below) which are nearly identical. One is the production system, the other is for testing. A few months ago we had some trouble with the test system. Postgres complained about a missing pg_clog file during nightly routine VACUUM/ANALYZE. Some days later, the same error occurred on the production system, even on the same table! The corrupted table is one of those bigger ones involved into the file processing. After searching the web we found a hint that this problem could be related to a bug in 8.1.3, so we upgraded to 8.1.8 and restored the databases on both servers. This was around May, but now we discovered the same problem on our production system again. Actually, the error occurred four weeks ago, but it was not discovered until now â if you do error logging, you should look at it from time to time ;) When trying to backup or vacuum the database, we receive one of the following error messages: **BACKUP** pg_dump: ERROR: could not access status of transaction 2134240 DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht gefunden pg_dump: SQL command to dump the contents of table "dateiblock" failed: PQendcopy() failed. pg_dump: Error message from server: ERROR: could not access status of transaction 2134240 DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht gefunden pg_dump: The command was: COPY public.dateiblock (id, eda_id, dbt_id, lfdnr, binaer_offset, laenge) TO stdout; **VACUUM** INFO: vacuuming "public.dateiblock" ERROR: could not access status of transaction 2134240 DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis nicht gefunden (For those not familiar to the German language: âDatei oder Verzeichnis nicht gefundenâ means âfile or directory not foundâ). Current pg_clogs range from 005A to 008F. The oldest one is dated to April 30th. We narrowed it down to a few records in that table. Some records contain unreasonable values, others produce the same message about the missing pg_clog file when selected and some are simply missing. But they must have existed, because there are still records in a second table referencing them. One strange thing about this is, that the referencing records are about two and a half months old and shouldnât been touched since then. We donât think this is a hardware issue, because we had it on two different servers and within a short period of time. Luckily, the loss of data is minimal. There are only about 30 records affected. Otherwise this would have been fatal, because as said before, our backup was not working either. In addition, we had another problem today. One of the indexes on a second table became corrupted, causing the database backend to rollback all transactions and restart the processes. A REINDEX fixed it, but it leaves me with a bad feeling about what will break next. 2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't match 2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id, dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status, kartennr , wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) 2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was terminated by signal 6 2007-07-25 08:07:00 CEST LOG: terminating any other active server processes 2007-07-25 08:07:00 CEST WARNING: terminating connection because of crash of another server process 2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. Kind regards, Marc Schablewski System: OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64 System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP proliant server) 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database) Postgres 8.1.8 Postgres was compiled manually but with default parameters. The configuration has been tuned to improve performance. shared_buffers = 30000 maintenance_work_mem = 131072 max_fsm_pages = 500000 default_statistics_target = 200 redirect_stderr = on log_directory = '/var/log/pg_log' log_min_duration_statement = 10 log_duration = on log_line_prefix = '%t ' stats_start_collector = on stats_command_string = on stats_block_level = on stats_row_level = on stats_reset_on_server_start = off VACUUM ANALYZE runs once a day, auto_vacuum is off.
On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote: > The following bug has been logged online: > > Bug reference: 3484 > Logged by: Marc Schablewski > Email address: ms@clickware.de > PostgreSQL version: 8.1.8 > Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64) > Description: Missing pg_clog file / corrupt index > Details: > > Our application receives and processes payment information that=20=20 > comes in > plain text files and stores the processed data into different=20=20 > tables in our > database. There are about 5 tables involved, three of them with 35=20=20 > million > records so far. We get approximately 150000 payments a day. Each=20=20 > payment is > handled in a single transaction, because in case of an error, we=20=20 > want to > store as many payments as possible. We have about 500000 INSERT and=20=20 > a few > UPDATE statements each day. The whole application runs on two=20=20 > servers (see > specs below) which are nearly identical. One is the production=20=20 > system, the > other is for testing. > > A few months ago we had some trouble with the test system. Postgres > complained about a missing pg_clog file during nightly routine > VACUUM/ANALYZE. Some days later, the same error occurred on the=20=20 > production > system, even on the same table! The corrupted table is one of those=20=20 > bigger > ones involved into the file processing. After searching the web we=20=20 > found a > hint that this problem could be related to a bug in 8.1.3, so we=20=20 > upgraded to > 8.1.8 and restored the databases on both servers. This was around=20=20 > May, but > now we discovered the same problem on our production system again.=20=20 > Actually, > the error occurred four weeks ago, but it was not discovered until=20=20 > now =96 > if you do error logging, you should look at it from time to time ;) > > When trying to backup or vacuum the database, we receive one of the > following error messages: > > **BACKUP** > pg_dump: ERROR: could not access status of transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > pg_dump: SQL command to dump the contents of table "dateiblock"=20=20 > failed: > PQendcopy() failed. > pg_dump: Error message from server: ERROR: could not access status of > transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > pg_dump: The command was: COPY public.dateiblock (id, eda_id,=20=20 > dbt_id, lfdnr, > binaer_offset, laenge) TO stdout; > > **VACUUM** > INFO: vacuuming "public.dateiblock" > ERROR: could not access status of transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > > (For those not familiar to the German language: =91Datei oder=20=20 > Verzeichnis > nicht gefunden=92 means =91file or directory not found=92). > > Current pg_clogs range from 005A to 008F. The oldest one is dated=20=20 > to April > 30th. > > We narrowed it down to a few records in that table. Some records=20=20 > contain > unreasonable values, others produce the same message about the missing > pg_clog file when selected and some are simply missing. But they=20=20 > must have > existed, because there are still records in a second table=20=20 > referencing them. > One strange thing about this is, that the referencing records are=20=20 > about two > and a half months old and shouldn=92t been touched since then. > > We don=92t think this is a hardware issue, because we had it on two > different servers and within a short period of time. > > Luckily, the loss of data is minimal. There are only about 30 records > affected. Otherwise this would have been fatal, because as said=20=20 > before, our > backup was not working either. > > > In addition, we had another problem today. One of the indexes on a=20=20 > second > table became corrupted, causing the database backend to rollback all > transactions and restart the processes. A REINDEX fixed it, but it=20=20 > leaves me > with a bad feeling about what will break next. > > 2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't=20=20 > match > 2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id, > dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status, > kartennr > , wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES=20=20 > ($1, $2, $3, > $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) > 2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was=20=20 > terminated by > signal 6 > 2007-07-25 08:07:00 CEST LOG: terminating any other active server > processes > 2007-07-25 08:07:00 CEST WARNING: terminating connection because=20=20 > of crash > of another server process > 2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this=20=20 > server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > > > Kind regards, > > Marc Schablewski > > > > System: > > OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64 > > System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP=20=20 > proliant > server) > 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database) > > Postgres 8.1.8 > > Postgres was compiled manually but with default parameters. The > configuration has been tuned to improve performance. > > shared_buffers =3D 30000 > maintenance_work_mem =3D 131072 > max_fsm_pages =3D 500000 > default_statistics_target =3D 200 > redirect_stderr =3D on > log_directory =3D '/var/log/pg_log' > > log_min_duration_statement =3D 10 > log_duration =3D on > log_line_prefix =3D '%t ' > stats_start_collector =3D on > stats_command_string =3D on > stats_block_level =3D on > stats_row_level =3D on > stats_reset_on_server_start =3D off > > VACUUM ANALYZE runs once a day, auto_vacuum is off. Actually, this does sound like a hardware problem to me. You only=20=20 have 5 tables that get hit heavily, so you've likely got somewhere=20=20 around a 20% chance that corruption would hit the same table on two=20=20 different machines. So far you haven't said anything that sounds unusual about how you're=20=20 using the database, and the hardware certainly seems pretty common-=20 place, so I'm rather doubtful that it's software. But if you still=20=20 have copies of the bad database, someone might be able to help you. --=20 Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
I kept a copy of the data files in case it is needed, but I have to check first, if I am allowed to give away that information. Some of the data is confidential. If you just need the files containing the dammaged table, this won't be a big problem, because it does not contain any confidential information (as long as one data file only contains the data of one table). The other problem is the size of the files. The whole database is about 60GB and the files belonging to that table are about 2.5GB. Mayby there is a way to pre-select the data you need? Decibel! wrote: > Actually, this does sound like a hardware problem to me. You only have 5 > tables that get hit heavily, so you've likely got somewhere around a 20% > chance that corruption would hit the same table on two different machines. > > So far you haven't said anything that sounds unusual about how you're > using the database, and the hardware certainly seems pretty > common-place, so I'm rather doubtful that it's software. But if you > still have copies of the bad database, someone might be able to help you. -- Marc Schablewski click:ware Informationstechnik GmbH
"Marc Schablewski" <ms@clickware.de> writes: > I kept a copy of the data files in case it is needed, but I have to > check first, if I am allowed to give away that information. Some of the > data is confidential. If you just need the files containing the dammaged > table, this won't be a big problem, because it does not contain any > confidential information (as long as one data file only contains the > data of one table). The other problem is the size of the files. The > whole database is about 60GB and the files belonging to that table are > about 2.5GB. Mayby there is a way to pre-select the data you need? Perhaps. You could find the records with unreasonable values. But I don't think there's any convenient way to find the records which produce the clog error or which are missing unless they turn out to be on the same page. Out of curiosity, what do the unreasonable values look like? Earlier you said: > We narrowed it down to a few records in that table. Some records contain > unreasonable values, others produce the same message about the missing > pg_clog file when selected and some are simply missing. But they must have > existed, because there are still records in a second table referencing them. If you still have a live database with this data then if you can do SELECT ctid FROM tab WHERE ... for the records with unreasonable values that might tell you what blocks are corrupted. The value before the comma is the block number, which when multiplied by 8192 (assuming you're using 8k blocks) will tell you what file offset to look for the page. To find the file to look for the block in do: postgres=# select relfilenode from pg_class where relname = 'tablename'; relfilenode ------------- 16384 (1 row) Note that if the file offset is over 1G then you would be looking for a file named 16384.N where N is which gigabyte chunk. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
We also have the same exact problem - every 5 to 10 days when the data get to some size, PostgreSQL complains about missing pg_clog files, and invalid page headers during either vacuum or reindex operations. The problem happens on different customer sites with Linux 2.6.11. There is one particular table that is much more heavily used - 99% inserts, some selects. And this table turns to be the one having problems. If this happens at different companies so frequently, I'd doubt it's a hardware problem. And we did check the hardware but did not find any problems. Here are some sample messages:=20 2007-07-07T10:11:38+00:00 ERROR: could not access status of transaction 842085945 2007-07-07T10:11:38+00:00 DETAIL: could not open file "pg_clog/0323": No such file or directory 2007-08-07T08:10:23+00:00 ERROR: could not access status of transaction 1481866610 2007-08-07T08:10:24+00:00 DETAIL: could not open file "pg_clog/0585": No such file or directory 2007-08-07T08:13:55+00:00 ERROR: invalid page header in block 346965 of relation "sipmessage" 2007-08-07T08:30:16+00:00 ERROR: could not access status of transaction 1481866610 2007-08-07T08:30:16+00:00 DETAIL: could not open file "pg_clog/0585": No such file or directory 2007-08-07T08:34:08+00:00 ERROR: invalid page header in block 346965 of relation "sipmessage" 2007-08-07T08:51:02+00:00 ERROR: could not access status of transaction 1481866610 2007-08-07T08:51:02+00:00 DETAIL: could not open file "pg_clog/0585": No such file or directory 2007-08-13T10:12:07+00:00 ERROR: invalid page header in block 4018 of relation "calllegstart_sessionid" 2007-08-13T10:12:15+00:00 ERROR: could not access status of transaction 0 2007-08-13T10:12:15+00:00 DETAIL: could not create file "pg_subtrans/0201": File exists -----Original Message----- From: pgsql-bugs-owner@postgresql.org [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Gregory Stark Sent: Tuesday, July 31, 2007 7:00 AM To: Marc Schablewski Cc: Decibel!; pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index "Marc Schablewski" <ms@clickware.de> writes: > I kept a copy of the data files in case it is needed, but I have to > check first, if I am allowed to give away that information. Some of the > data is confidential. If you just need the files containing the dammaged > table, this won't be a big problem, because it does not contain any > confidential information (as long as one data file only contains the > data of one table). The other problem is the size of the files. The > whole database is about 60GB and the files belonging to that table are > about 2.5GB. Mayby there is a way to pre-select the data you need? Perhaps. You could find the records with unreasonable values. But I don't think there's any convenient way to find the records which produce the clog error or which are missing unless they turn out to be on the same page. Out of curiosity, what do the unreasonable values look like? Earlier you said: > We narrowed it down to a few records in that table. Some records contain > unreasonable values, others produce the same message about the missing > pg_clog file when selected and some are simply missing. But they must have > existed, because there are still records in a second table referencing them. If you still have a live database with this data then if you can do SELECT ctid FROM tab WHERE ... for the records with unreasonable values that might tell you what blocks are corrupted. The value before the comma is the block number, which when multiplied by 8192 (assuming you're using 8k blocks) will tell you what file offset to look for the page. To find the file to look for the block in do: postgres=3D# select relfilenode from pg_class where relname =3D 'tablename'; relfilenode=20 ------------- 16384 (1 row) Note that if the file offset is over 1G then you would be looking for a file named 16384.N where N is which gigabyte chunk. --=20 Gregory Stark EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings
On Fri, Aug 24, 2007 at 09:19:49AM -0400, Feng Chen wrote: > The problem happens on different customer sites with Linux 2.6.11. Well, at kernel.org in the changelog for 2.6.12, I see this: [PATCH] PCI: don't override drv->shutdown unconditionally There are many drivers that have been setting the generic driver model level shutdown callback, and pci thus must not override it. Without this patch we can have really bad data loss on various raid controllers. Could it be related to your problem? (I'm not trying to dismiss -- I've just been bitten by too many Linux corner cases not to worry about the kernel.) A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
Feng Chen wrote: > We also have the same exact problem - every 5 to 10 days when the data > get to some size, PostgreSQL complains about missing pg_clog files, and > invalid page headers during either vacuum or reindex operations. > > The problem happens on different customer sites with Linux 2.6.11. > > There is one particular table that is much more heavily used - 99% > inserts, some selects. And this table turns to be the one having > problems. > > If this happens at different companies so frequently, I'd doubt it's a > hardware problem. And we did check the hardware but did not find any > problems. > [...] > 2007-08-13T10:12:07+00:00 ERROR: invalid page header in block 4018 of > relation "calllegstart_sessionid" > 2007-08-13T10:12:15+00:00 ERROR: could not access status of > transaction 0 > 2007-08-13T10:12:15+00:00 DETAIL: could not create file > "pg_subtrans/0201": File exists what version of postgresql is this exactly ? there is a problem in older 8.1 versions that could cause this error under high transaction rates. Stefan
Feng Chen wrote: > We also have the same exact problem - every 5 to 10 days when the data > get to some size, PostgreSQL complains about missing pg_clog files, and > invalid page headers during either vacuum or reindex operations. What PG version is this? > 2007-08-13T10:12:15+00:00 ERROR: could not access status of > transaction 0 > 2007-08-13T10:12:15+00:00 DETAIL: could not create file > "pg_subtrans/0201": File exists Huh??? -- Alvaro Herrera Developer, http://www.PostgreSQL.org/ "Aprende a avergonzarte más ante ti que ante los demás" (Demócrito)
VERSION =3D PostgreSQL 8.1.2 -----Original Message----- From: Alvaro Herrera [mailto:alvherre@commandprompt.com]=20 Sent: Friday, August 24, 2007 10:02 AM To: Feng Chen Cc: Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org Subject: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index Feng Chen wrote: > We also have the same exact problem - every 5 to 10 days when the data > get to some size, PostgreSQL complains about missing pg_clog files, and > invalid page headers during either vacuum or reindex operations. What PG version is this? > 2007-08-13T10:12:15+00:00 ERROR: could not access status of > transaction 0 > 2007-08-13T10:12:15+00:00 DETAIL: could not create file > "pg_subtrans/0201": File exists Huh??? --=20 Alvaro Herrera Developer, http://www.PostgreSQL.or= g/ "Aprende a avergonzarte m=E1s ante ti que ante los dem=E1s" (Dem=F3crito)
On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote: > VERSION = PostgreSQL 8.1.2 You know that the project doesn't put out maintenance releases for the fun of it, right? The latest is 8.1.9 in that series. You need to upgrade. A -- Andrew Sullivan | ajs@crankycanuck.ca Unfortunately reformatting the Internet is a little more painful than reformatting your hard drive when it gets out of whack. --Scott Morris
Feng Chen wrote: > VERSION = PostgreSQL 8.1.2 this could be http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php which is fixed in 8.1.3 and later - so you really should look into upgrading to 8.1.9 as soon as possible ,,, Stefan
On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote: > On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote: > > VERSION = PostgreSQL 8.1.2 > > You know that the project doesn't put out maintenance releases for > the fun of it, right? The latest is 8.1.9 in that series. You need > to upgrade. Indeed, looking at the release notes: # Fix race condition that could lead to "file already exists" errors during pg_clog and pg_subtrans file creation (Tom) # -- Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard to use, but at least there is a lot of code underneath." --Damien Katz
Thanks for the replies! I guess it's time for 8.1.9 then. Will keep you informed afterwards! -----Original Message----- From: Andrew Sullivan [mailto:ajs@crankycanuck.ca]=20 Sent: Friday, August 24, 2007 10:14 AM To: Andrew Sullivan Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index On Fri, Aug 24, 2007 at 10:11:06AM -0400, Andrew Sullivan wrote: > On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote: > > VERSION =3D PostgreSQL 8.1.2 >=20 > You know that the project doesn't put out maintenance releases for > the fun of it, right? The latest is 8.1.9 in that series. You need > to upgrade. Indeed, looking at the release notes: # Fix race condition that could lead to "file already exists" errors during pg_clog and pg_subtrans file creation (Tom) # --=20 Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard=20 to use, but at least there is a lot of code underneath." --Damien Katz
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Feng Chen wrote: >> VERSION = PostgreSQL 8.1.2 > this could be > http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php Note, however, that that only explains the "could not create ... File exists" complaint, which is hardly the main problem here. The combination of "invalid page header" and "could not access status of transaction" messages looks like nothing so much as severe data corruption. Remember that the xmin/xmax fields are basically the first thing we can check with any degree of strictness when examining a tuple. This means that if a page is partially clobbered, but not in a way that sets off the invalid-page-header checks, then the odds are very high that the first detectable sign of trouble will be references to transaction numbers that are far away from what the system is really using. (Is 1481866610 anywhere near the current XID counter reported by pg_controldata?) I concur with the upthread comment to check into kernel bugs, particularly if all of your machines are using the same old kernel release. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Remember that the xmin/xmax fields are basically the first thing we can > check with any degree of strictness when examining a tuple. This means that > if a page is partially clobbered, but not in a way that sets off the > invalid-page-header checks, then the odds are very high that the first > detectable sign of trouble will be references to transaction numbers that > are far away from what the system is really using. I'm increasingly thinking that one of the first things I'll suggest putting into 8.4 is a per-page checksum after all. It was talked about a while back and people thought it was pointless but I think the number of reports of hardware and kernel bugs resulting in zeroed and corrupted pages has been steadily going up. If not in total than as a percentage of the total problems. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Yes, most of the time PostgreSQL only complains about missing pg_clog/ files then complains about invalid page headers. The "could not create ... File exists" message was only seen a couple of times. I don't have a system that exhibits this problem right now. I can check the XID counter when it happens again. Thanks! -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 Sent: Friday, August 24, 2007 12:11 PM To: Stefan Kaltenbrunner Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index=20 Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Feng Chen wrote: >> VERSION =3D PostgreSQL 8.1.2 > this could be > http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php Note, however, that that only explains the "could not create ... File exists" complaint, which is hardly the main problem here. The combination of "invalid page header" and "could not access status of transaction" messages looks like nothing so much as severe data corruption. Remember that the xmin/xmax fields are basically the first thing we can check with any degree of strictness when examining a tuple. This means that if a page is partially clobbered, but not in a way that sets off the invalid-page-header checks, then the odds are very high that the first detectable sign of trouble will be references to transaction numbers that are far away from what the system is really using. (Is 1481866610 anywhere near the current XID counter reported by pg_controldata?) I concur with the upthread comment to check into kernel bugs, particularly if all of your machines are using the same old kernel release. regards, tom lane
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> Remember that the xmin/xmax fields are basically the first thing we can >> check with any degree of strictness when examining a tuple. This means that >> if a page is partially clobbered, but not in a way that sets off the >> invalid-page-header checks, then the odds are very high that the first >> detectable sign of trouble will be references to transaction numbers that >> are far away from what the system is really using. > I'm increasingly thinking that one of the first things I'll suggest putting > into 8.4 is a per-page checksum after all. It was talked about a while back > and people thought it was pointless but I think the number of reports of > hardware and kernel bugs resulting in zeroed and corrupted pages has been > steadily going up. If not in total than as a percentage of the total problems. It's still pointless; a checksum does nothing to prevent data corruption. The error report might be slightly more obvious to a novice but it doesn't bring your data back. Something we could possibly do now is to modify these error messages: if the transaction number we're trying to check is obviously bogus (beyond the current XID counter or older than the current freeze horizon) we could report it as a corrupted XID rather than exposing the "no such clog segment" condition. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > It's still pointless; a checksum does nothing to prevent data > corruption. The error report might be slightly more obvious to a novice > but it doesn't bring your data back. Well if it's a single bit error from bad memory or a torn page from having full_page_writes turned off then the resulting page could be entirely valid. The user might not find out about the corrupt data before it's had time to migrate elsewhere. Also, the sooner the corrupt data is reported the sooner the user can restore from backups and avoid further data loss. The last discussion of this feature concentrated on beingg able to detect torn page corruption with full_page_writes turned off during recovery. > Something we could possibly do now is to modify these error messages: > if the transaction number we're trying to check is obviously bogus > (beyond the current XID counter or older than the current freeze > horizon) we could report it as a corrupted XID rather than exposing > the "no such clog segment" condition. That would be clever. I take it you mean the invalid values would be those values older than the actual relfrozenxid. Certainly they should throw some sort of error instead of trying to find the transaction in the clog. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Andrew Sullivan wrote: > On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote: > > VERSION = PostgreSQL 8.1.2 > > You know that the project doesn't put out maintenance releases for > the fun of it, right? The latest is 8.1.9 in that series. You need > to upgrade. We moved the "you should upgrade for all minor releases" from FAQ text to a link to our web site: http://www.postgresql.org/support/versioning The problem is that instead of those words being in the FAQ, they are now linked from the FAQ, and I am concerned that fewer people are seeing that recommendation. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
We've got the invalid page header again: ERROR: invalid page header in block 18223 of relation "calllegstart" bash-3.00# pg_controldata /var/lib/pgsql/data pg_control version number: 812 Catalog version number: 200510211 Database system identifier: 5101280766299435989 Database cluster state: in production pg_control last modified: Mon Aug 27 17:32:12 2007 Current log file ID: 6 Next log file segment: 242 Latest checkpoint location: 6/EFB4C658 Prior checkpoint location: 6/EDC84A9C Latest checkpoint's REDO location: 6/EFB04720 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 23661755 Latest checkpoint's NextOID: 24576 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Time of latest checkpoint: Mon Aug 27 17:29:17 2007 Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: floating-point numbers Maximum length of locale name: 128 LC_COLLATE: C LC_CTYPE: C -----Original Message----- From: Feng Chen=20 Sent: Friday, August 24, 2007 1:42 PM To: 'Tom Lane'; Stefan Kaltenbrunner Cc: Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org Subject: RE: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index=20 Yes, most of the time PostgreSQL only complains about missing pg_clog/ files then complains about invalid page headers. The "could not create ... File exists" message was only seen a couple of times. I don't have a system that exhibits this problem right now. I can check the XID counter when it happens again. Thanks! -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 Sent: Friday, August 24, 2007 12:11 PM To: Stefan Kaltenbrunner Cc: Feng Chen; Alvaro Herrera; Gregory Stark; Marc Schablewski; Decibel!; pgsql-bugs@postgresql.org Subject: [SPAM] Re: [SPAM] Re: [BUGS] BUG #3484: Missing pg_clog file / corrupt index=20 Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes: > Feng Chen wrote: >> VERSION =3D PostgreSQL 8.1.2 > this could be > http://archives.postgresql.org/pgsql-committers/2006-01/msg00287.php Note, however, that that only explains the "could not create ... File exists" complaint, which is hardly the main problem here. The combination of "invalid page header" and "could not access status of transaction" messages looks like nothing so much as severe data corruption. Remember that the xmin/xmax fields are basically the first thing we can check with any degree of strictness when examining a tuple. This means that if a page is partially clobbered, but not in a way that sets off the invalid-page-header checks, then the odds are very high that the first detectable sign of trouble will be references to transaction numbers that are far away from what the system is really using. (Is 1481866610 anywhere near the current XID counter reported by pg_controldata?) I concur with the upthread comment to check into kernel bugs, particularly if all of your machines are using the same old kernel release. regards, tom lane
On Fri, Aug 24, 2007 at 05:56:29PM -0400, Bruce Momjian wrote: > We moved the "you should upgrade for all minor releases" from FAQ text > to a link to our web site: > > http://www.postgresql.org/support/versioning > > The problem is that instead of those words being in the FAQ, they are > now linked from the FAQ, and I am concerned that fewer people are seeing > that recommendation. Could be, but I sort of doubt that's the problem. I'd be surprised if people who have this problem are reading the FAQ. But in any case, the Q that is listed there doesn't correspond to the linked text: the question asks about the upgrade procedure, and the versioning link talks about (initially) which version you should be running. I'd break that into two Qs: "What's the upgrade procedure?" (link to docs) and "What version should I use?" (link to versioning policy). A -- Andrew Sullivan | ajs@crankycanuck.ca In the future this spectacle of the middle classes shocking the avant- garde will probably become the textbook definition of Postmodernism. --Brad Holland
On Fri, Aug 24, 2007 at 05:56:29PM -0400, Bruce Momjian wrote: > Andrew Sullivan wrote: > > On Fri, Aug 24, 2007 at 10:06:56AM -0400, Feng Chen wrote: > > > VERSION =3D PostgreSQL 8.1.2 > >=20 > > You know that the project doesn't put out maintenance releases for > > the fun of it, right? The latest is 8.1.9 in that series. You need > > to upgrade. >=20 > We moved the "you should upgrade for all minor releases" from FAQ text > to a link to our web site: >=20 > http://www.postgresql.org/support/versioning >=20 > The problem is that instead of those words being in the FAQ, they are > now linked from the FAQ, and I am concerned that fewer people are seeing > that recommendation. I think the backend should spew WARNINGs with increasing frequency if it's more than 2 point releases old. And it should refuse to start if it's more than 5.</sarcasm> Seriously, I doubt anything short of that will make a big difference, but certainly putting the actual verbage back in the FAQ can't hurt. --=20 Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
I've a question that's somehow relateted to this bug. I hope it's ok to post it here, even if it's not a bug report. We are planning to set up a standby system, in case our productive database system crashes again. Replication by WAL archiving is one possible solution. But the question is: would an error like the one we had appear in WAL and would it be replicated too? Or is there some kind of consistency check, that prevents broken WAL from being restored? Marc Schablewski click:ware Informationstechnik GmbH Decibel! wrote: > On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote: >> The following bug has been logged online: >> >> Bug reference: 3484 >> Logged by: Marc Schablewski >> Email address: ms@clickware.de >> PostgreSQL version: 8.1.8 >> Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64) >> Description: Missing pg_clog file / corrupt index >> Details: >> >> Our application receives and processes payment information that comes in >> plain text files and stores the processed data into different tables >> in our >> database. There are about 5 tables involved, three of them with 35 >> million >> records so far. We get approximately 150000 payments a day. Each >> payment is >> handled in a single transaction, because in case of an error, we want to >> store as many payments as possible. We have about 500000 INSERT and a >> few >> UPDATE statements each day. The whole application runs on two servers >> (see >> specs below) which are nearly identical. One is the production >> system, the >> other is for testing. >> >> A few months ago we had some trouble with the test system. Postgres >> complained about a missing pg_clog file during nightly routine >> VACUUM/ANALYZE. Some days later, the same error occurred on the >> production >> system, even on the same table! The corrupted table is one of those >> bigger >> ones involved into the file processing. After searching the web we >> found a >> hint that this problem could be related to a bug in 8.1.3, so we >> upgraded to >> 8.1.8 and restored the databases on both servers. This was around >> May, but >> now we discovered the same problem on our production system again. >> Actually, >> the error occurred four weeks ago, but it was not discovered until now >> if you do error logging, you should look at it from time to time ;) >> >> When trying to backup or vacuum the database, we receive one of the >> following error messages: >> >> **BACKUP** >> pg_dump: ERROR: could not access status of transaction 2134240 >> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis >> nicht >> gefunden >> pg_dump: SQL command to dump the contents of table "dateiblock" failed: >> PQendcopy() failed. >> pg_dump: Error message from server: ERROR: could not access status of >> transaction 2134240 >> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis >> nicht >> gefunden >> pg_dump: The command was: COPY public.dateiblock (id, eda_id, dbt_id, >> lfdnr, >> binaer_offset, laenge) TO stdout; >> >> **VACUUM** >> INFO: vacuuming "public.dateiblock" >> ERROR: could not access status of transaction 2134240 >> DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis >> nicht >> gefunden >> >> (For those not familiar to the German language: Datei oder Verzeichnis >> nicht gefunden means file or directory not found). >> >> Current pg_clogs range from 005A to 008F. The oldest one is dated to >> April >> 30th. >> >> We narrowed it down to a few records in that table. Some records contain >> unreasonable values, others produce the same message about the missing >> pg_clog file when selected and some are simply missing. But they must >> have >> existed, because there are still records in a second table >> referencing them. >> One strange thing about this is, that the referencing records are >> about two >> and a half months old and shouldnt been touched since then. >> >> We dont think this is a hardware issue, because we had it on two >> different servers and within a short period of time. >> >> Luckily, the loss of data is minimal. There are only about 30 records >> affected. Otherwise this would have been fatal, because as said >> before, our >> backup was not working either. >> >> >> In addition, we had another problem today. One of the indexes on a >> second >> table became corrupted, causing the database backend to rollback all >> transactions and restart the processes. A REINDEX fixed it, but it >> leaves me >> with a bad feeling about what will break next. >> >> 2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't match >> 2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id, >> dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status, >> kartennr >> , wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES ($1, >> $2, $3, >> $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) >> 2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was >> terminated by >> signal 6 >> 2007-07-25 08:07:00 CEST LOG: terminating any other active server >> processes >> 2007-07-25 08:07:00 CEST WARNING: terminating connection because of >> crash >> of another server process >> 2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this >> server >> process to roll back the current transaction and exit, because another >> server process exited abnormally and possibly corrupted shared memory. >> >> >> Kind regards, >> >> Marc Schablewski >> >> >> >> System: >> >> OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64 >> >> System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP >> proliant >> server) >> 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database) >> >> Postgres 8.1.8 >> >> Postgres was compiled manually but with default parameters. The >> configuration has been tuned to improve performance. >> >> shared_buffers = 30000 >> maintenance_work_mem = 131072 >> max_fsm_pages = 500000 >> default_statistics_target = 200 >> redirect_stderr = on >> log_directory = '/var/log/pg_log' >> >> log_min_duration_statement = 10 >> log_duration = on >> log_line_prefix = '%t ' >> stats_start_collector = on >> stats_command_string = on >> stats_block_level = on >> stats_row_level = on >> stats_reset_on_server_start = off >> >> VACUUM ANALYZE runs once a day, auto_vacuum is off. > > Actually, this does sound like a hardware problem to me. You only have > 5 tables that get hit heavily, so you've likely got somewhere around a > 20% chance that corruption would hit the same table on two different > machines. > > So far you haven't said anything that sounds unusual about how you're > using the database, and the hardware certainly seems pretty > common-place, so I'm rather doubtful that it's software. But if you > still have copies of the bad database, someone might be able to help you.