Re: Deep Trouble - Mailing list pgsql-general

From Karl DeBisschop
Subject Re: Deep Trouble
Date
Msg-id 39EC49CA.66D8C6DA@debisschop.net
Whole thread Raw
In response to A better stop  ("Warren Flemmer" <warren@netlab.co.za>)
List pgsql-general
Abe Asghar wrote:
>
>     Hi guys,
>
> Havin some trouble.
>
> One of my databases appeared to be empty suddenly after having a large
> amount of data in it.  I contacted our server company and they gave me the
> postgres dir.
>
> I have put back the folder of the newsdatabase from the base dir into the
> base dir of Postgres and recreated the database.
>
> That's fine.
>
> However now the database is empty.  When I do a cat on the file of the same
> name as one of the tables - it has loads of data in it.  However when I go
> in to Postgres and try to list the table it comes back with ) rows.
>
> Any ideas I am desperate.
>
> I am using linux redhat and Postgres
>
> Thanks
> Abe

We saw this on PG 7.0.2 with RH 6.2smp (specifically kernel
2.2.16-3smp). What happened in our case was that the transaction

If it's the same bug, the data is there, here's a couple of diagnostics
from Tom Lane:




>>>>> Message 1 >>>>>>>
Well, I still don't know what caused the "not a btree" complaint, but
I see what's the matter with your missing databases: the current
transaction numbers are too small.  In /var/lib/pgsql/data.rpmnew,
pg_variable has a next transaction number of 3586:

[tlane@sterno data.rpmnew]$ pwd
/var/lib/pgsql/data.rpmnew
[tlane@sterno data.rpmnew]$ od -l pg_variable
0000000           0        3586           0   321084263
0000020           0           0           0           0
*
0020000

and in the older DB it's 5634:

[tlane@sterno data]$ pwd
/disk/1/pgsql/data
[tlane@sterno data]$ od -d pg_variable
   [tlane@sterno data]$ od -l pg_variable
0000000           0        5634           0   320895230
0000020           0           0           0           0
*
0020000

Now your DBs (other than template1) were all created by much higher
transaction numbers.  For example in the older installation "biobase"'s
pg_database tuple has t_xmin = 9752, meaning it was created by
transaction number 9752.  That's far in the future according to the
system's current view of the world, so the pg_database tuple is
considered not to be committed yet, which is why you can't see it.
You can connect to that DB anyway because the connection-time check
for tuple validity is a little bit lax, but once you're connected
most of the system-table tuples will look like they don't exist yet,
ditto tuples you try to create, thus things act pretty broken.

So now we know the immediate cause of the weird behavior, and the
next question is how did pg_variable get into this state?  Is it
conceivable that you somehow restored an old copy over the current
version of the file, or some such thing?

BTW, in your running postmaster (port 5432), active transaction
numbers seem to be around 22500, which itself seems suspiciously
low, unless this is a very low-traffic server or has recently been
initdb'd.  Test method:

[tlane@sterno data]$ PGPORT=5432 psql biobase
Welcome to psql, the PostgreSQL interactive terminal.

biobase=# create table foofoofoo (f1 int);
CREATE
biobase=# select oid,xmin,* from pg_class where relname = 'foofoofoo';
    oid    | xmin  |  relname  | reltype | relowner | relam | relpages |
reltuples | rellongrelid
| relhasindex | relisshared | relkind | relnatts | relchecks |
reltriggers | relukeys | relfkeys
| relrefs | relhaspkey | relhasrules | relacl

-----------+-------+-----------+---------+----------+-------+----------+-----------+--------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+-------------+--------
 327912528 | 22529 | foofoofoo |       0 |    10002 |     0 |       10
|      1000 |            0
| f           | f           | r       |        1 |         0 |
0 |        0 |        0
|       0 | f          | f           |
(1 row)

biobase=# drop table foofoofoo;
DROP

>>>>>>>>>>>>> Message 2

> If it does happen again, is there any way I can reset the reset the
> tarnasction number manually? I'm doing a pg_dumpall every six hours, but
> it's sort of time consuming to have to do a complete install/dump/restore
> when the DB goes haywire llike this.

You can stop the postmaster and hack on pg_variable manually.  The only
trick is to know what to set the value to.  Probably (byte length of
pg_log / 4) - 1 would work.

<<<<<<<<< end quoted messages

We never solved this one -- we reverted Postgres to 7.0-2 (the -2 is the
RPM), and have not had a problem with that installation since.  We
expect to have a free dual processor machine sometime around the end of
the year, but we really cannot risk further experimentation on this
issue with any of or current dual-processer systems which are all
mission critical, and all have utilizations that are already as high as
we'd like them to be. We hope to test more aggressively once that spare
machine is free, but right now we don't have the hardware or manpower
necessary.

We have 7.0.2 running in several single processor machines with no
problem. But the SMP does not seem to play well with 7.0.2, or at least
with our build of 7.0.2, which might have different optimizations that
Lamar's RPM, I suppose.

--
Karl DeBisschop                      kdebisschop@alert.infoplease.com
Learning Network/Information Please  http://www.infoplease.com
Netsaint Plugin Developer            kdebisschop@users.sourceforge.net

pgsql-general by date:

Previous
From: Victor Ivanov
Date:
Subject: C function and NULL result
Next
From: luc
Date:
Subject: off topic