Re: BUG #3484: Missing pg_clog file / corrupt index - Mailing list pgsql-bugs

From Decibel!
Subject Re: BUG #3484: Missing pg_clog file / corrupt index
Date
Msg-id AF7FB1C8-1457-41D4-9419-D9068470FCB9@decibel.org
Whole thread Raw
In response to BUG #3484: Missing pg_clog file / corrupt index  ("Marc Schablewski" <ms@clickware.de>)
Responses Re: BUG #3484: Missing pg_clog file / corrupt index
Re: BUG #3484: Missing pg_clog file / corrupt index
List pgsql-bugs
On Jul 25, 2007, at 4:02 PM, Marc Schablewski wrote:
> The following bug has been logged online:
>
> Bug reference:      3484
> Logged by:          Marc Schablewski
> Email address:      ms@clickware.de
> PostgreSQL version: 8.1.8
> Operating system:   SuSE Linux 10.0 / Kernel 2.6.13-15.8-smp (x86-64)
> Description:        Missing pg_clog file / corrupt index
> Details:
>
> Our application receives and processes payment information that=20=20
> comes in
> plain text files and stores the processed data into different=20=20
> tables in our
> database. There are about 5 tables involved, three of them with 35=20=20
> million
> records so far. We get approximately 150000 payments a day. Each=20=20
> payment is
> handled in a single transaction, because in case of an error, we=20=20
> want to
> store as many payments as possible. We have about 500000 INSERT and=20=20
> a few
> UPDATE statements each day. The whole application runs on two=20=20
> servers (see
> specs below) which are nearly identical. One is the production=20=20
> system, the
> other is for testing.
>
> A few months ago we had some trouble with the test system. Postgres
> complained about a missing pg_clog file during nightly routine
> VACUUM/ANALYZE. Some days later, the same error occurred on the=20=20
> production
> system, even on the same table! The corrupted table is one of those=20=20
> bigger
> ones involved into the file processing. After searching the web we=20=20
> found a
> hint that this problem could be related to a bug in 8.1.3, so we=20=20
> upgraded to
> 8.1.8 and restored the databases on both servers. This was around=20=20
> May, but
> now we discovered the same problem on our production system again.=20=20
> Actually,
> the error occurred four weeks ago, but it was not discovered until=20=20
> now =96
> if you do error logging, you should look at it from time to time ;)
>
> When trying to backup or vacuum the database, we receive one of the
> following error messages:
>
> **BACKUP**
> pg_dump: ERROR:  could not access status of transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
> pg_dump: SQL command to dump the contents of table "dateiblock"=20=20
> failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  could not access status of
> transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
> pg_dump: The command was: COPY public.dateiblock (id, eda_id,=20=20
> dbt_id, lfdnr,
> binaer_offset, laenge) TO stdout;
>
> **VACUUM**
> INFO:  vacuuming "public.dateiblock"
> ERROR:  could not access status of transaction 2134240
> DETAIL:  could not open file "pg_clog/0002": Datei oder Verzeichnis=20=20
> nicht
> gefunden
>
> (For those not familiar to the German language: =91Datei oder=20=20
> Verzeichnis
> nicht gefunden=92 means =91file or directory not found=92).
>
> Current pg_clogs range from 005A to 008F. The oldest one is dated=20=20
> to April
> 30th.
>
> We narrowed it down to a few records in that table. Some records=20=20
> contain
> unreasonable values, others produce the same message about the missing
> pg_clog file when selected and some are simply missing. But they=20=20
> must have
> existed, because there are still records in a second table=20=20
> referencing them.
> One strange thing about this is, that the referencing records are=20=20
> about two
> and a half months old and shouldn=92t been touched since then.
>
> We don=92t think this is a hardware issue, because we had it on two
> different servers and within a short period of time.
>
> Luckily, the loss of data is minimal. There are only about 30 records
> affected. Otherwise this would have been fatal, because as said=20=20
> before, our
> backup was not working either.
>
>
> In addition, we had another problem today. One of the indexes on a=20=20
> second
> table became corrupted, causing the database backend to rollback all
> transactions and restart the processes. A REINDEX fixed it, but it=20=20
> leaves me
> with a bad feeling about what will break next.
>
> 2007-07-25 08:07:00 CEST PANIC:  right sibling's left-link doesn't=20=20
> match
> 2007-07-25 08:07:00 CEST STATEMENT:  insert into transaktion (kaz_id,
> dbl_id, sta_id, kss_id, summe, zeitpunkt, tracenr, terminalid, status,
> kartennr
> , wae_id, kassenschnitt, freigabe, flo_id, importdatum)  VALUES=20=20
> ($1, $2, $3,
> $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15)
> 2007-07-25 08:07:00 CEST LOG:  server process (PID 5699) was=20=20
> terminated by
> signal 6
> 2007-07-25 08:07:00 CEST LOG:  terminating any other active server
> processes
> 2007-07-25 08:07:00 CEST WARNING:  terminating connection because=20=20
> of crash
> of another server process
> 2007-07-25 08:07:00 CEST DETAIL:  The postmaster has commanded this=20=20
> server
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
>
>
> Kind regards,
>
> Marc Schablewski
>
>
>
> System:
>
> OS:    SUSE LINUX 10.0 (x86-64), 2.6.13-15.8-smp x86_64
>
> System:    2x Intel(R) Xeon(TM) CPU 2.80GHz Dual Core, 4GB RAM (HP=20=20
> proliant
> server)
> 3Ware 9500S-4LP, 2xRAID1 (one for OS & one for database)
>
> Postgres 8.1.8
>
> Postgres was compiled manually but with default parameters. The
> configuration has been tuned to improve performance.
>
> shared_buffers =3D 30000
> maintenance_work_mem =3D 131072
> max_fsm_pages =3D 500000
> default_statistics_target =3D 200
> redirect_stderr =3D on
> log_directory =3D '/var/log/pg_log'
>
> log_min_duration_statement =3D 10
> log_duration =3D on
> log_line_prefix =3D '%t '
> stats_start_collector =3D on
> stats_command_string =3D on
> stats_block_level =3D on
> stats_row_level =3D on
> stats_reset_on_server_start =3D off
>
> VACUUM ANALYZE runs once a day, auto_vacuum is off.

Actually, this does sound like a hardware problem to me. You only=20=20
have 5 tables that get hit heavily, so you've likely got somewhere=20=20
around a 20% chance that corruption would hit the same table on two=20=20
different machines.

So far you haven't said anything that sounds unusual about how you're=20=20
using the database, and the hardware certainly seems pretty common-=20
place, so I'm rather doubtful that it's software. But if you still=20=20
have copies of the bad database, someone might be able to help you.
--=20
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

pgsql-bugs by date:

Previous
From: "Flavio Botelho"
Date:
Subject: BUG #3500: Horrible performance when wrong type is set in prepared statement
Next
From: Decibel!
Date:
Subject: Re: BUG #3486: doc bug - Preventing transaction ID wraparound failures