Thread: another pg database corruption?
Hi all, I am using pg 7.3.4 on linuz red hat 7.3 and reiserFS. After ~30 days error free work my pg stops. I my log I found: PANIC: open of /mnt/diske/skladdb/pg_clog/0D02 failed: No such file or directory LOG: statement: COPY public.a_sp (ids, ids_ma, ids_mp, ids_mr, ids_mpr, ids_mpdt, ids_mpkt, smetkas, smetkaa, smetkad1, smetkad2, tip, vaji, ime, imeeng, balans, opr, pp, prihrazh_razh, prihrazh_prih, ids_firma) TO stdout; LOG: server process (pid 1252) was terminated by signal 6 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2003-10-27 14:34:01 EET LOG: checkpoint record is at 2/1C591C4C LOG: redo record is at 2/1C591C4C; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 493496; next oid: 197060600 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: record with zero length at 2/1C591C8C LOG: redo is not required LOG: database system is ready PANIC: open of /mnt/diske/skladdb/pg_clog/09FD failed: No such file or directory LOG: statement: COPY public.a_acc (ids, ids_firma, ids_debit, ids_kredit, date_op, ids_vid_doc, ids_otdel, ids_papka, ids_papka1, ids_papka2, parv_doc, doc, nomer, val, voborot, kurs, oborotlv, zab1, zab2, kod1, kod2, kod3, addkod, pp, instime, modtime, ids_slu) TO stdout; LOG: server process (pid 1260) was terminated by signal 6 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing shared memory and semaphores LOG: database system was interrupted at 2003-10-27 14:37:10 EET How can I found the reason for this problem. Also we try to use pg in production and to replace oracle, but it looks very unstable with this corruption. Exist any way to prevent pg from this error (for example real time backup for changes or realtime replication?). We are using pg from 3 y. and we was very happy, but in production it looks not so stable. many thanks, ivan.
On Mon, Oct 27, 2003 at 06:21:40PM +0100, pginfo wrote: > Hi all, > I am using pg 7.3.4 on linuz red hat 7.3 and reiserFS. How sure are you that your patch level for your kernel is good? ISTR some issues with reiserfs on some versions of the recent kernels, but there've been so many filesystem problems with Linux over the last couple years that my recollection is probably not all it might be. Anyway, I would start wondering about hardware. Try badblocks on your disk. Note that to produce useful results, you may have to do the destructive tests. You'll be wanting to back up your data first. There have been occasional reports of clog files being destroyed -- search the archives for some quick fixes, but note that you are likely to have some inconsistent data. So far, all the problems have been attributable to hardware, but nobody is ruling out a bug, if it is reproducable. I think Tom Lane is especially interested in looking at cases like this; or he was last time I talked to him about it. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Hi Andrew, Andrew Sullivan wrote: > On Mon, Oct 27, 2003 at 06:21:40PM +0100, pginfo wrote: > > Hi all, > > I am using pg 7.3.4 on linuz red hat 7.3 and reiserFS. > > How sure are you that your patch level for your kernel is good? ISTR > some issues with reiserfs on some versions of the recent kernels, but > there've been so many filesystem problems with Linux over the last > couple years that my recollection is probably not all it might be. > I do not have made any patchin on my kernel.I used the standart reiserfs with my linux distro. If it exists better filesystem I am ready to use it. Also I have bad had bad results with ext3. In general I need journal file system. > Anyway, I would start wondering about hardware. Try badblocks on > your disk. Note that to produce useful results, you may have to do > the destructive tests. You'll be wanting to back up your data first. I can not:). If I try pg_dump the pg crashes.One of my problems is to restore this data if possible. I have a cron script that makes pg_dump every 3 h., but it is not possible to collect the missing data, because we have many users. From this position I spoke, that I need to be sure that pg is very stable. > There have been occasional reports of clog files being destroyed -- > search the archives for some quick fixes, but note that you are > likely to have some inconsistent data. Any idea for fixing this data is wellcome. I will check the data inconsistent. > So far, all the problems have > been attributable to hardware, but nobody is ruling out a bug, if it > is reproducable. We used the system 30 days without any reboot or stop. The last one was for java upgrade and do not any problems.We do not have any changes in hardware ( I will not say it is not hardware). > I think Tom Lane is especially interested in > looking at cases like this; or he was last time I talked to him about > it. > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Afilias Canada Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings regards, ivan.
On Tue, Oct 28, 2003 at 03:12:37PM +0100, pginfo wrote: > > > > I do not have made any patchin on my kernel.I used the standart reiserfs > with my linux distro. Yes, but have you kept up to date with new kernel releases from Red Hat? They're pretty good about releasing patched kernels if there is a problem. > > your disk. Note that to produce useful results, you may have to do > > the destructive tests. You'll be wanting to back up your data first. > > I can not:). If I try pg_dump the pg crashes.One of my problems is to > restore this data if possible. Shut down the database, I'm afraid, and copy the data directory before you start doing things. > >From this position I spoke, that I need to be sure that pg is very stable. So far as I've heard, the cases of clog file corruption have all been related to hardware. You'd best think about replacing your disk or your controller. Have you had any crashes lately? Anything unusual? Are you using ECC RAM? (If not, are you sure you don't have bad RAM? If you got a bad bit written at the wrong time, you'd have a real mess.) > Any idea for fixing this data is wellcome. I will check the data > inconsistent. As I understand it, you need to zero out the file with dummy data in order to get going again. You really need to plough the archives for what to do, though. I've never had to do this. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Andrew Sullivan wrote: > On Tue, Oct 28, 2003 at 03:12:37PM +0100, pginfo wrote: > > > > > > > I do not have made any patchin on my kernel.I used the standart reiserfs > > with my linux distro. > > Yes, but have you kept up to date with new kernel releases from Red > Hat? Realy no.But I have many instalations with this version and the same config. And all this is working well for long periode. The problem is that the corrupted was the biggest one. > They're pretty good about releasing patched kernels if there is > a problem. > > > > your disk. Note that to produce useful results, you may have to do > > > the destructive tests. You'll be wanting to back up your data first. > > > > I can not:). If I try pg_dump the pg crashes.One of my problems is to > > restore this data if possible. > > Shut down the database, I'm afraid, and copy the data directory > before you start doing things. > I did it. > > >From this position I spoke, that I need to be sure that pg is very stable. > > So far as I've heard, the cases of clog file corruption have all been > related to hardware. You'd best think about replacing your disk or > your controller. The system was with 3 hdd. One for my application server one for pg and one for archives.After the first problem I execute initdb -D on my archive disk, got the last possible backup and start to insert data. All was well, but after 10 h of work pg stops ( I sendet this error here). Also later I tryed to recreate the db on my last disk and this time I had problems by restoring the data. So I think I do not have problem with my hdd ( if I do not have problem with all 3 hdd at same time, or with controller). > Have you had any crashes lately? Anything unusual? > Are you using ECC RAM? (If not, are you sure you don't have bad RAM? > If you got a bad bit written at the wrong time, you'd have a real > mess.) My rad is 1G ECC each. I am using 2 G of RAM on this box.. > > > > Any idea for fixing this data is wellcome. I will check the data > > inconsistent. > > As I understand it, you need to zero out the file with dummy data in > order to get going again. You really need to plough the archives for > what to do, though. I've never had to do this. > > A > > -- > ---- > Andrew Sullivan 204-4141 Yonge Street > Afilias Canada Toronto, Ontario Canada > <andrew@libertyrms.info> M2P 2A8 > +1 416 646 3304 x110 > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend reagards, ivan.
> Realy no.But I have many instalations with this version and the same > config. And all this is working well for long periode. > The problem is that the corrupted was the biggest one. The good thing about PC hardware is that it is "same same but different". The same hardware (board,cpu etc.) might behave different (chip rev. etc). I suggest you move the database for testing to another machine. Try to reproduce the problem there. regards -andreas PS: FYI ext3 is an ext2 filesystem with journaling enabled (man tunefs) -- Andreas Schmitz - Phone +49 201 8501 318 Cityweb-Technik-Service-Gesellschaft mbH Friedrichstr. 12 - Fax +49 201 8501 104 45128 Essen - email a.schmitz@cityweb.de