Thread: Database corruption in 7.0.3

Database corruption in 7.0.3

From
Tim Allen
Date:
We have an application that we were running quite happily using pg6.5.3
in various customer sites. Now we are about to roll out a new version of
our application, and we are going to use pg7.0.3. However, in testing
we've come across a couple of isolated incidents of database
corruption. They are sufficiently rare that I can't reproduce the problem,
nor can I put my finger on just what application behaviour causes the
problems.

The symptoms most often involve some sort of index corruption, which is
reported by vacuum and it seems that vacuum can fix it. On occasion vacuum
reports "invalid OID" or similar (sorry, don't have exact wording of
message). On one occasion the database has been corrupted to the point of
unusability (ie vacuum admitted that it couldn't fix the problem), and a
dump/restore was required (thankfully that at least worked). The index
corruption also occasionally manifests itself in the form of spurious
uniqueness constraint violation errors.

The previous version of our app using 6.5.3 has never shown the slightest
symptom of database misbehaviour, to the best of my knowledge, despite
fairly extensive use. So our expectations are fairly high :-).

One thing that is different about the new version of our app is that we
now use multiple connections to the database (previously we only had
one). We can in practice have transactions in progress on several
connections at once, and it is possible for some transactions to be rolled
back under application control (ie explicit ROLLBACK; statement).

I realise I haven't really provided an awful lot of information that would
help identify the problem, so I shall attempt to be understanding if
no-one can offer any useful suggestions. But I hope someone can :-). Has
anyone seen this sort of problem before? Are there any known
database-corrupting bugs in 7.0.3? I don't recall anyone mentioning any in
the mailing lists. Is using multiple connections likely to stimulate any
known areas of risk?

BTW we are using plain vanilla SQL, no triggers, no new types defined, no
functions, no referential integrity checks, nothing more ambitious than a
multi-column primary key.

The platform is x86 Red Hat Linux 6.2. Curiously enough, on one of our
testing boxes and on my development box we have never seen this, but we
have seen it several times on our other test box and at least one customer
site, so there is some possibility it's related to dodgy hardware. The
customer box with the problem is a multi-processor box, all the other
boxes we've tested on are single-processor.

TIA for any help,

Tim

-- 
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/ http://www4.tpg.com.au/users/rita_tim/



Re: Database corruption in 7.0.3

From
Denis Perchine
Date:
Can confirm this. Get this just yesterday time ago...

Messages:

NOTICE:  Rel acm: TID 1697/217: OID IS INVALID. TUPGONE 1.

And lots of such lines...
And

pqReadData() -- backend closed the channel unexpectedly.       This probably means the backend terminated abnormally
  before or while processing the request.
 


In the end :-((( I lost a library of our institute... :-((( But I have a 
backup!!! :-)))) This table even have NO indices!!!

Program received signal SIGSEGV, Segmentation fault.
0x813837f in PageRepairFragmentation (page=0x82840b0 "") at bufpage.c:311
311                             alignedSize = MAXALIGN((*lp).lp_len);
(gdb) bt
#0  0x813837f in PageRepairFragmentation (page=0x82840b0 "") at bufpage.c:311
#1  0x80a9b07 in vc_scanheap (vacrelstats=0x82675b0, onerel=0x8273428, 
vacuum_pages=0xbfffe928, fraged_pages=0xbfffe918) at vacuum.c:1022
#2  0x80a8e8b in vc_vacone (relid=27296, analyze=0 '\000', va_cols=0x0) at 
vacuum.c:599
#3  0x80a8217 in vc_vacuum (VacRelP=0xbfffe9b4, analyze=0 '\000', 
va_cols=0x0) at vacuum.c:299
#4  0x80a818b in vacuum (vacrel=0x8267400 "", verbose=1 '\001', analyze=0 
'\000', va_spec=0x0) at vacuum.c:223
#5  0x813fba5 in ProcessUtility (parsetree=0x8267418, dest=Remote) at 
utility.c:694
#6  0x813c16e in pg_exec_query_dest (query_string=0x820aaa0 "vacuum verbose 
acm;", dest=Remote, aclOverride=0 '\000') at postgres.c:617
#7  0x813c08e in pg_exec_query (query_string=0x820aaa0 "vacuum verbose acm;") 
at postgres.c:562
#8  0x813d4c3 in PostgresMain (argc=9, argv=0xbffff068, real_argc=9, 
real_argv=0xbffffa3c) at postgres.c:1588
#9  0x811ace5 in DoBackend (port=0x8223068) at postmaster.c:2009
#10 0x811a639 in BackendStartup (port=0x8223068) at postmaster.c:1776
#11 0x811932f in ServerLoop () at postmaster.c:1037
#12 0x8118b0e in PostmasterMain (argc=9, argv=0xbffffa3c) at postmaster.c:725
#13 0x80d5e5e in main (argc=9, argv=0xbffffa3c) at main.c:93
#14 0x40111fee in __libc_start_main () from /lib/libc.so.6

This is plain 7.0.3.

On Thursday 15 March 2001 14:52, Tim Allen wrote:
> We have an application that we were running quite happily using pg6.5.3
> in various customer sites. Now we are about to roll out a new version of
> our application, and we are going to use pg7.0.3. However, in testing
> we've come across a couple of isolated incidents of database
> corruption. They are sufficiently rare that I can't reproduce the problem,
> nor can I put my finger on just what application behaviour causes the
> problems.
>
> The symptoms most often involve some sort of index corruption, which is
> reported by vacuum and it seems that vacuum can fix it. On occasion vacuum
> reports "invalid OID" or similar (sorry, don't have exact wording of
> message). On one occasion the database has been corrupted to the point of
> unusability (ie vacuum admitted that it couldn't fix the problem), and a
> dump/restore was required (thankfully that at least worked). The index
> corruption also occasionally manifests itself in the form of spurious
> uniqueness constraint violation errors.
>
> The previous version of our app using 6.5.3 has never shown the slightest
> symptom of database misbehaviour, to the best of my knowledge, despite
> fairly extensive use. So our expectations are fairly high :-).
>
> One thing that is different about the new version of our app is that we
> now use multiple connections to the database (previously we only had
> one). We can in practice have transactions in progress on several
> connections at once, and it is possible for some transactions to be rolled
> back under application control (ie explicit ROLLBACK; statement).
>
> I realise I haven't really provided an awful lot of information that would
> help identify the problem, so I shall attempt to be understanding if
> no-one can offer any useful suggestions. But I hope someone can :-). Has
> anyone seen this sort of problem before? Are there any known
> database-corrupting bugs in 7.0.3? I don't recall anyone mentioning any in
> the mailing lists. Is using multiple connections likely to stimulate any
> known areas of risk?
>
> BTW we are using plain vanilla SQL, no triggers, no new types defined, no
> functions, no referential integrity checks, nothing more ambitious than a
> multi-column primary key.
>
> The platform is x86 Red Hat Linux 6.2. Curiously enough, on one of our
> testing boxes and on my development box we have never seen this, but we
> have seen it several times on our other test box and at least one customer
> site, so there is some possibility it's related to dodgy hardware. The
> customer box with the problem is a multi-processor box, all the other
> boxes we've tested on are single-processor.
>
> TIA for any help,
>
> Tim

-- 
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: dyp@perchine.com
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


Re: Database corruption in 7.0.3

From
Tom Lane
Date:
Tim Allen <tim@proximity.com.au> writes:
> Are there any known database-corrupting bugs in 7.0.3?

None that aren't also in earlier releases, AFAIR, so your report is
fairly troubling.  However there's not enough here to venture a guess
about the source of the problem.

Do you see any backend crashes or other misbehavior before the VACUUM
error pops up, or is that the only symptom?

It would be a good idea to rebuild the system with assert checks on
(configure --enable-cassert), in hopes that some Assert a little closer
to the source of the problem will fire.  Also, if you can spare some
disk space for logging, running the postmaster with -d2 to log all
queries might provide useful historical context when the problem
reappears.

I would like to be able to study the corrupted table, as well.  Can you
see your way to either giving me access to your machine, or (if the
database isn't too large) sending me a tar dump of the whole $PGDATA
directory next time it happens?

Please contact me off-list so we can figure out how best to pursue this
problem.
        regards, tom lane