Re: BUG #3484: Missing pg_clog file / corrupt index - Mailing list pgsql-bugs
From | Decibel! |
---|---|
Subject | Re: BUG #3484: Missing pg_clog file / corrupt index |
Date | |
Msg-id | AF7FB1C8-1457-41D4-9419-D9068470FCB9@decibel.org Whole thread Raw |
In response to | BUG #3484: Missing pg_clog file / corrupt index ("Marc Schablewski" <ms@clickware.de>) |
Responses |
Re: BUG #3484: Missing pg_clog file / corrupt index
Re: BUG #3484: Missing pg_clog file / corrupt index |
List | pgsql-bugs |
On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote: > The following bug has been logged online: > > Bug reference: 3484 > Logged by: Marc Schablewski > Email address: ms@clickware.de > PostgreSQL version: 8.1.8 > Operating system: SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64) > Description: Missing pg_clog file / corrupt index > Details: > > Our application receives and processes payment information that=20=20 > comes in > plain text files and stores the processed data into different=20=20 > tables in our > database. There are about 5 tables involved, three of them with 35=20=20 > million > records so far. We get approximately 150000 payments a day. Each=20=20 > payment is > handled in a single transaction, because in case of an error, we=20=20 > want to > store as many payments as possible. We have about 500000 INSERT and=20=20 > a few > UPDATE statements each day. The whole application runs on two=20=20 > servers (see > specs below) which are nearly identical. One is the production=20=20 > system, the > other is for testing. > > A few months ago we had some trouble with the test system. Postgres > complained about a missing pg_clog file during nightly routine > VACUUM/ANALYZE. Some days later, the same error occurred on the=20=20 > production > system, even on the same table! The corrupted table is one of those=20=20 > bigger > ones involved into the file processing. After searching the web we=20=20 > found a > hint that this problem could be related to a bug in 8.1.3, so we=20=20 > upgraded to > 8.1.8 and restored the databases on both servers. This was around=20=20 > May, but > now we discovered the same problem on our production system again.=20=20 > Actually, > the error occurred four weeks ago, but it was not discovered until=20=20 > now =96 > if you do error logging, you should look at it from time to time ;) > > When trying to backup or vacuum the database, we receive one of the > following error messages: > > **BACKUP** > pg_dump: ERROR: could not access status of transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > pg_dump: SQL command to dump the contents of table "dateiblock"=20=20 > failed: > PQendcopy() failed. > pg_dump: Error message from server: ERROR: could not access status of > transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > pg_dump: The command was: COPY public.dateiblock (id, eda_id,=20=20 > dbt_id, lfdnr, > binaer_offset, laenge) TO stdout; > > **VACUUM** > INFO: vacuuming "public.dateiblock" > ERROR: could not access status of transaction 2134240 > DETAIL: could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20 > nicht > gefunden > > (For those not familiar to the German language: =91Datei oder=20=20 > Verzeichnis > nicht gefunden=92 means =91file or directory not found=92). > > Current pg_clogs range from 005A to 008F. The oldest one is dated=20=20 > to April > 30th. > > We narrowed it down to a few records in that table. Some records=20=20 > contain > unreasonable values, others produce the same message about the missing > pg_clog file when selected and some are simply missing. But they=20=20 > must have > existed, because there are still records in a second table=20=20 > referencing them. > One strange thing about this is, that the referencing records are=20=20 > about two > and a half months old and shouldn=92t been touched since then. > > We don=92t think this is a hardware issue, because we had it on two > different servers and within a short period of time. > > Luckily, the loss of data is minimal. There are only about 30 records > affected. Otherwise this would have been fatal, because as said=20=20 > before, our > backup was not working either. > > > In addition, we had another problem today. One of the indexes on a=20=20 > second > table became corrupted, causing the database backend to rollback all > transactions and restart the processes. A REINDEX fixed it, but it=20=20 > leaves me > with a bad feeling about what will break next. > > 2007-07-25 08:07:00 CEST PANIC: right sibling's left-link doesn't=20=20 > match > 2007-07-25 08:07:00 CEST STATEMENT: insert into transaktion (kaz_id, > dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status, > kartennr > , wae_id, kassenschnitt, freigabe, flo_id, importdatum) VALUES=20=20 > ($1, $2, $3, > $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15) > 2007-07-25 08:07:00 CEST LOG: server process (PID 5699) was=20=20 > terminated by > signal 6 > 2007-07-25 08:07:00 CEST LOG: terminating any other active server > processes > 2007-07-25 08:07:00 CEST WARNING: terminating connection because=20=20 > of crash > of another server process > 2007-07-25 08:07:00 CEST DETAIL: The postmaster has commanded this=20=20 > server > process to roll back the current transaction and exit, because another > server process exited abnormally and possibly corrupted shared memory. > > > Kind regards, > > Marc Schablewski > > > > System: > > OS: SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64 > > System: 2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP=20=20 > proliant > server) > 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database) > > Postgres 8.1.8 > > Postgres was compiled manually but with default parameters. The > configuration has been tuned to improve performance. > > shared_buffers =3D 30000 > maintenance_work_mem =3D 131072 > max_fsm_pages =3D 500000 > default_statistics_target =3D 200 > redirect_stderr =3D on > log_directory =3D '/var/log/pg_log' > > log_min_duration_statement =3D 10 > log_duration =3D on > log_line_prefix =3D '%t ' > stats_start_collector =3D on > stats_command_string =3D on > stats_block_level =3D on > stats_row_level =3D on > stats_reset_on_server_start =3D off > > VACUUM ANALYZE runs once a day, auto_vacuum is off. Actually, this does sound like a hardware problem to me. You only=20=20 have 5 tables that get hit heavily, so you've likely got somewhere=20=20 around a 20% chance that corruption would hit the same table on two=20=20 different machines. So far you haven't said anything that sounds unusual about how you're=20=20 using the database, and the hardware certainly seems pretty common-=20 place, so I'm rather doubtful that it's software. But if you still=20=20 have copies of the bad database, someone might be able to help you. --=20 Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
pgsql-bugs by date: