Losing data from Postgres - Mailing list pgsql-admin

From Paul Breen
Subject Losing data from Postgres
Date
Msg-id Pine.LNX.3.96.1001115152350.6198F-100000@cpark37.computerpark.co.uk
Whole thread Raw
Responses Re: Losing data from Postgres
Re: Losing data from Postgres
List pgsql-admin
Hello everyone,

Can anyone help us?

We are using Postgres in a hotspare configuration, that is, we have 2
separate boxes both running identical versions of Postgres and everytime
we insert|update|delete from the database we write to both boxes (at the
application level).  All communications to the databases are in
transaction blocks and if we cannot commit to both databases then we
rollback.

Recently we have discovered that some records are missing from the main
database table on the master box but are present in the table on the
spare.  We know its not a transaction problem (rolling back on master but
not spare) because reports printed at the time of the records being
written show the data in the master.  It seems that some time afterwards
the records just disappear from the master!  Both boxes are raided (Raid
5), don't know if this is significant.

Originally we were vacuuming twice a day but because some of the reports
we produce regularly were taking too long as the database grew, we added
multiple indexes onto the key tables and began vacuuming every hour.  It's
only after doing this that we noticed the data loss - don't know if this
is coincidental or not.  Yesterday we went back to vacuuming only twice a
day.

Looking in Postgres' server log we found some worrying error messages.
Again, don't know if these are significant:

1000822.00:39:44.676 [5182] NOTICE:  LockRelease: locktable lookup failed,
no lock

515 instances on the master, 7 on the spare.

1000929.15:33:40.454 [20007] NOTICE:  LockReplace: xid table corrupted

80 instances on the master, 0 on the spare - is this the problem?

1001027.11:55:04.071 [4847] NOTICE:  Index dttmprod_ndx: NUMBER OF INDEX'
TUPLES (33586) IS NOT THE SAME AS HEAP' (33578)

1065 instances on the master, 20 on the spare.  The reason there is so
many of these messages is that we didn't spot this error until after we
had done a lot of vacuums.  Since rebuilding the indexes (dropping &
re-creating) we haven't seen this one again.

We are using Postgres version 6.5.3.  I know its old but this is the
customer standard and we are loathe to upgrade unless its a problem
inherent with this version.  From a cost-of-ownership viewpoint, its
easier to support a single version than multiple versions.  The OS is SuSE
Linux 6.3 (kernel 2.2.13), the control system programs talk to Postgres
via libpq, the web interface is PHP3 and the Tcl/Tk interface is pgtksh.
The Raid controller is a DPT SmartCache 4 (28 MB RAM).   Both the master
and the spare boxes are Intel Pentium III 550Mhz with 256MB RAM.

If we've left anything out, write and we'll supply more information.

Any help, especially around Postgres' server log messages would be greatly
appreciated.


Paul M. Breen, Software Engineer - Computer Park Ltd.

Tel:   (01536) 417155
Email: pbreen@computerpark.co.uk


pgsql-admin by date:

Previous
From: "Elias "
Date:
Subject: Read please
Next
From: Jean-Marc Pigeon
Date:
Subject: Re: Losing data from Postgres