Thread: A better stop

A better stop

From

"Warren Flemmer"

Date:

17 October 2000, 08:08:32

Greetings

I was/am having the intermittent starting of postgresql problem. The error
messages recorded in the logs seem to imply some sort of shared memory
problem. The system is as stripped as possible, (only essentials and
postgresql) giving me the impression that a lack of buffers etc unlightly.
So I simplified my tests and came up with the following:

Postgresql will always start if I do not query the DB (using ODBC on an NT
server). If I shut down the server while it is in the middle of a large
query then it will not start next time round. The stopping of postgresql
reports failed. The shutdown command used 'shutdown -r now'. When shutting
down the '/etc/rc.d/init.d/postgresql stop' is run. There the killproc in
this script.
I am using the script out the box so to speak, the only alteration is that
of the logfile. It has all the cleanups that the mail list archives mention.

I find it difficult to believe I am the only one experiencing this. It seems
to me, and I am far from home ground here, that when shutting down the first
step should be to stop any incoming queries and complete any existing work
and then kill the process (perhaps this is what kill proc does).

Perhaps the shutdown script should change so the if the killproc fails it
wait a few seconds and retries. Again if it fails it should waits again and
again until it does stop successfully or a timeout is reached.

What are your thoughts on this? Am I going completely in the wrong
direction? Would changing the script be a good idea?

Regards
Warren

Config
Redhat 6.2 with PostgreSQL 7

Deep Trouble

From

"Abe Asghar"

Date:

17 October 2000, 08:11:45

    Hi guys,

Havin some trouble.

One of my databases appeared to be empty suddenly after having a large
amount of data in it.  I contacted our server company and they gave me the
postgres dir.

I have put back the folder of the newsdatabase from the base dir into the
base dir of Postgres and recreated the database.

That's fine.

However now the database is empty.  When I do a cat on the file of the same
name as one of the tables - it has loads of data in it.  However when I go
in to Postgres and try to list the table it comes back with ) rows.

Any ideas I am desperate.

I am using linux redhat and Postgres

Thanks
Abe

Re: Deep Trouble

From

Karl DeBisschop

Date:

17 October 2000, 08:57:08

Abe Asghar wrote:
>
>     Hi guys,
>
> Havin some trouble.
>
> One of my databases appeared to be empty suddenly after having a large
> amount of data in it.  I contacted our server company and they gave me the
> postgres dir.
>
> I have put back the folder of the newsdatabase from the base dir into the
> base dir of Postgres and recreated the database.
>
> That's fine.
>
> However now the database is empty.  When I do a cat on the file of the same
> name as one of the tables - it has loads of data in it.  However when I go
> in to Postgres and try to list the table it comes back with ) rows.
>
> Any ideas I am desperate.
>
> I am using linux redhat and Postgres
>
> Thanks
> Abe

We saw this on PG 7.0.2 with RH 6.2smp (specifically kernel
2.2.16-3smp). What happened in our case was that the transaction

If it's the same bug, the data is there, here's a couple of diagnostics
from Tom Lane:

>>>>> Message 1 >>>>>>>
Well, I still don't know what caused the "not a btree" complaint, but
I see what's the matter with your missing databases: the current
transaction numbers are too small.  In /var/lib/pgsql/data.rpmnew,
pg_variable has a next transaction number of 3586:

[tlane@sterno data.rpmnew]$ pwd
/var/lib/pgsql/data.rpmnew
[tlane@sterno data.rpmnew]$ od -l pg_variable
0000000           0        3586           0   321084263
0000020           0           0           0           0
*
0020000

and in the older DB it's 5634:

[tlane@sterno data]$ pwd
/disk/1/pgsql/data
[tlane@sterno data]$ od -d pg_variable
   [tlane@sterno data]$ od -l pg_variable
0000000           0        5634           0   320895230
0000020           0           0           0           0
*
0020000

Now your DBs (other than template1) were all created by much higher
transaction numbers.  For example in the older installation "biobase"'s
pg_database tuple has t_xmin = 9752, meaning it was created by
transaction number 9752.  That's far in the future according to the
system's current view of the world, so the pg_database tuple is
considered not to be committed yet, which is why you can't see it.
You can connect to that DB anyway because the connection-time check
for tuple validity is a little bit lax, but once you're connected
most of the system-table tuples will look like they don't exist yet,
ditto tuples you try to create, thus things act pretty broken.

So now we know the immediate cause of the weird behavior, and the
next question is how did pg_variable get into this state?  Is it
conceivable that you somehow restored an old copy over the current
version of the file, or some such thing?

BTW, in your running postmaster (port 5432), active transaction
numbers seem to be around 22500, which itself seems suspiciously
low, unless this is a very low-traffic server or has recently been
initdb'd.  Test method:

[tlane@sterno data]$ PGPORT=5432 psql biobase
Welcome to psql, the PostgreSQL interactive terminal.

biobase=# create table foofoofoo (f1 int);
CREATE
biobase=# select oid,xmin,* from pg_class where relname = 'foofoofoo';
    oid    | xmin  |  relname  | reltype | relowner | relam | relpages |
reltuples | rellongrelid
| relhasindex | relisshared | relkind | relnatts | relchecks |
reltriggers | relukeys | relfkeys
| relrefs | relhaspkey | relhasrules | relacl

-----------+-------+-----------+---------+----------+-------+----------+-----------+--------------+-------------+-------------+---------+----------+-----------+-------------+----------+----------+---------+------------+-------------+--------
 327912528 | 22529 | foofoofoo |       0 |    10002 |     0 |       10
|      1000 |            0
| f           | f           | r       |        1 |         0 |
0 |        0 |        0
|       0 | f          | f           |
(1 row)

biobase=# drop table foofoofoo;
DROP

>>>>>>>>>>>>> Message 2

> If it does happen again, is there any way I can reset the reset the
> tarnasction number manually? I'm doing a pg_dumpall every six hours, but
> it's sort of time consuming to have to do a complete install/dump/restore
> when the DB goes haywire llike this.

You can stop the postmaster and hack on pg_variable manually.  The only
trick is to know what to set the value to.  Probably (byte length of
pg_log / 4) - 1 would work.

<<<<<<<<< end quoted messages

We never solved this one -- we reverted Postgres to 7.0-2 (the -2 is the
RPM), and have not had a problem with that installation since.  We
expect to have a free dual processor machine sometime around the end of
the year, but we really cannot risk further experimentation on this
issue with any of or current dual-processer systems which are all
mission critical, and all have utilizations that are already as high as
we'd like them to be. We hope to test more aggressively once that spare
machine is free, but right now we don't have the hardware or manpower
necessary.

We have 7.0.2 running in several single processor machines with no
problem. But the SMP does not seem to play well with 7.0.2, or at least
with our build of 7.0.2, which might have different optimizations that
Lamar's RPM, I suppose.

--
Karl DeBisschop                      kdebisschop@alert.infoplease.com
Learning Network/Information Please  http://www.infoplease.com
Netsaint Plugin Developer            kdebisschop@users.sourceforge.net