strange behavior (corruption?) of large production database - Mailing list pgsql-hackers

From Joe Conway
Subject strange behavior (corruption?) of large production database
Date
Msg-id 4390DC54.2050400@joeconway.com
Whole thread Raw
Responses Re: strange behavior (corruption?) of large production database
List pgsql-hackers
We have very strange behavior from an internal production database.

There are multiple symptoms, all pointing to a problem with clusterwide 
tables. For example:

postgres@csdfds1:~> psql -U postgres -p 5433 cyspec
Welcome to psql 7.4.8, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms       \h for help with SQL commands       \? for help on internal slash
commands      \g or terminate with semicolon to execute query       \q to quit
 

cyspec=# select version();                                       version
------------------------------------------------------------------------------------- PostgreSQL 7.4.8 on
i686-pc-linux-gnu,compiled by GCC gcc (GCC) 3.3.3 
 
(SuSE Linux)
(1 row)

postgres@csdfds1:~> psql -l           List of databases      Name      |  Owner   | Encoding
----------------+----------+----------- cyspec         | postgres | SQL_ASCII temp_mike      | postgres | SQL_ASCII
temp_mike_new | postgres | SQL_ASCII temp_mike_orig | postgres | SQL_ASCII template0      | postgres | SQL_ASCII
template1     | postgres | SQL_ASCII
 
(6 rows)


cyspec=# select * from pg_database; datname | datdba | encoding | datistemplate | datallowconn | 
datlastsysoid | datvacuumxid | datfrozenxid | datpath | datconfig | datacl

---------+--------+----------+---------------+--------------+---------------+--------------+--------------+---------+-----------+--------
(0 rows)

cyspec=# \l    List of databases Name | Owner | Encoding
------+-------+----------
(0 rows)


No databases found. Additionally:


cyspec=# select usename, usesysid from pg_shadow; usename  | usesysid
----------+---------- postgres |        1 colxl    |      102 colro    |      101 l400509  |      105
(4 rows)

cyspec=# \c - colprod
You are now connected as new user "colprod".
cyspec=> \c - colxl
You are now connected as new user "colxl".
cyspec=> \c - colprod
You are now connected as new user "colprod".
cyspec=> \c - zxcvvb
FATAL:  user "zxcvvb" does not exist
Previous connection kept

The "colprod" user has disappeared from pg_shadow (there was one 
previously, and it was never intentionally dropped), but I can still 
connect with that user. The current problem was actually initially found 
because pg_dump complained that the owner of the colprod schema didn't 
exist.

One more thing:
cyspec=# show wal_sync_method; wal_sync_method
----------------- fdatasync
(1 row)

That works, but SHOW ALL and "select * from pg_settings;" return lines 
and lines of nothing in psql. (I mean literally blank lines, not even "(0 rows)")
After issuing \o /tmp/filename the output is all there, and looks normal.

The oddness was first noticed about 3 days after a maintenance shutdown. 
As far as I have been told, during the maintenance window, there may 
have been OS level package upgrades, and there was a firmware upgrade 
done on the storage subsystem (NetApp).

Any advice at what to look at/do would be appreciated. This database is 
somewhere around 1.1 TB in size, so dump and reload is not something 
we're anxious to do.

Joe




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Reducing relation locking overhead
Next
From: Tom Lane
Date:
Subject: Re: Spam 508