Re: production server down - Mailing list pgsql-hackers

From Joe Conway
Subject Re: production server down
Date
Msg-id 41C4AF23.9070406@joeconway.com
Whole thread Raw
In response to Re: production server down  (Michael Fuhr <mike@fuhr.org>)
Responses Re: production server down
Re: production server down
pg_resetxlog for 8.0 (was Re: production server down)
List pgsql-hackers
Michael Fuhr wrote:
> On Wed, Dec 15, 2004 at 11:41:02AM -0800, Joe Conway wrote:
> 
>>Just wanted to close the loop for the sake of the list archives. With 
>>Tom's xlog dump tool I was able (with a bunch of his help off-list) to 
>>identify the needed parameters for pg_resetxlog. Running pg_resetxlog 
>>got us back a running database. We're now involved in checking the data.
> 
> Any chance you could write up a summary of the thread: what caused
> the problem, how you diagnosed it, how you fixed it, and how to
> avoid it?  Might make a useful "lessons learned" document.

Sorry for the delay -- been a busy week. See a summary below. Hope 
someone finds this useful. Warning -- this is a bit long...


-----------
Background:
-----------
The server in question is an IBM x445, 8-way Xeon, 8 GB RAM. We're 
running SuSE 9 with the postgresql-7.4.5-36.4 RPM. The database is just 
over 400GB in size at this point, and resides on a dedicated NFS mounted 
Network Appliance volume (~6 TB). The server has 2 network interfaces, 
both gigabit ethernet. One interface (eth0) is dedicated to the NFS 
mounted storage. It is on a private storage subsystem vlan, running with 
"jumbo frames" (9K):

# ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:09:6B:E6:33:B7          [...]          UP BROADCAST RUNNING MULTICAST
MTU:9000 Metric:1
 

The data volume is mounted thus:

# cat /etc/fstab
csddata7-vlan35:/vol/vol0/replica /replica             nfs 
proto=tcp,suid,rw,vers=3,proto=tcp,timeo=600,retrans=2,hard,fg,rsize=8192,wsize=8192 
0 0

We have been continuously and aggressively bulk loading parametric data 
collected from our company's equipment in the field (i.e. in use at our 
customers) for the past several weeks. This is part of a push to get 
"caught up" with available data, after which we expect bulk loading to 
take a few hours each evening. The server had been up since November 2, 
2004.

On December 13 the server experienced a complete hang, requiring our 
unix admin to go into the datacenter and physically cycle power. We 
don't know the exact cause of that hang, but we have recently 
experienced a similar hang on two similar servers (both IBM x445, both 
running SuSE 8.x, one running Oracle 9i, the other an application server).

<aside>  In both of those cases we were advised to increase the size of our  swap partition -- for some reason either
SuSE'sinstallation defaults,  or the admin doing the installation (not sure which), had decided that  1GB swap was
sufficientfor machines with 8GB of RAM. Novell's support  advised to increase swap to >= physical RAM. We had made that
change on the two other servers, but not on the Postgres server because it  was busy loading data at the time. We've
nowmade that change, applied  all IBM firmware updates, and are waiting to see if the problem  repeats.
 
</aside>

Upon server restart, there was instability noted with the network 
interfaces, and so they were reset (not exactly sure how the unix admin 
determined that, but that is what I was told). Unfortunately, the init.d 
script was in the process of starting postgresql while the eth0 
interface (and thus the NFS mount and PGDATA) was "unstable".


--------------------------------
The PostgreSQL Specific Problem:
--------------------------------
We decided to wait until the 14th before resuming data loading, because 
the developer who wrote the loading scripts was due back in the office 
that day after taking vacation. On the 14th, he noted that Postgres was 
not running, and further, would not start. Here is the snippet from the 
logs:

2004-12-13 15:05:52 LOG:  recycled transaction log file "000001650000004C"
2004-12-13 15:26:01 LOG:  recycled transaction log file "000001650000004D"
2004-12-13 16:39:55 LOG:  database system was shut down at 2004-11-02 
17:05:33 PST
2004-12-13 16:39:55 LOG:  checkpoint record is at 0/9B0B8C
2004-12-13 16:39:55 LOG:  redo record is at 0/9B0B8C; undo record is at 
0/0; shutdown TRUE
2004-12-13 16:39:55 LOG:  next transaction ID: 536; next OID: 17142
2004-12-13 16:39:55 LOG:  database system is ready
2004-12-14 15:36:20 FATAL:  IDENT authentication failed for user "colprod"
2004-12-14 15:36:58 FATAL:  IDENT authentication failed for user "colprod"
2004-12-14 15:39:26 LOG:  received smart shutdown request
2004-12-14 15:39:26 LOG:  shutting down
2004-12-14 15:39:28 PANIC:  could not open file 
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No 
such file or directory
2004-12-14 15:39:28 LOG:  shutdown process (PID 23202) was terminated by 
signal 6
2004-12-14 15:39:39 LOG:  database system shutdown was interrupted at 
2004-12-14 15:39:26 PST
2004-12-14 15:39:39 LOG:  could not open file 
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No 
such file or directory
2004-12-14 15:39:39 LOG:  invalid primary checkpoint record
2004-12-14 15:39:39 LOG:  could not open file 
"/replica/pgdata/pg_xlog/0000000000000000" (log file 0, segment 0): No 
such file or directory
2004-12-14 15:39:39 LOG:  invalid secondary checkpoint record
2004-12-14 15:39:39 PANIC:  could not locate a valid checkpoint record
2004-12-14 15:39:39 LOG:  startup process (PID 23298) was terminated by 
signal 6
2004-12-14 15:39:39 LOG:  aborting startup due to startup process failure

Apparently, either because of the server hang, or because of the flakey 
eth0 interface on reboot, pg_control had become "corrupt". However, it 
was not corrupt in the sense that it contained impossibly invalid data. 
In fact, as pointed out by Alvaro, it had values that all look close to 
those one would find in a recently initdb'd pg_control file, except the 
last modified date:

# pg_controldata /replica/pgdata
pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               shutting down
pg_control last modified:             Tue Dec 14 15:39:26 2004
Current log file ID:                  0
Next log file segment:                1
Latest checkpoint location:           0/9B0B8C
Prior checkpoint location:            0/9AA1B4
Latest checkpoint's REDO location:    0/9B0B8C
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        12
Latest checkpoint's NextXID:          536
Latest checkpoint's NextOID:          17142
Time of latest checkpoint:            Tue Nov  2 17:05:32 2004
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

We still do not know exactly how the pg_control file came to look like 
this. I suspect the NFS mount instability during server start somehow 
caused it, but I can't see anything obvious in xlog.c that would explain 
that theory.


-----------
The Repair:
-----------
Based on Tom's advice, we set out to repair pg_control using 
pg_resetxlog (but first thing, of course, we saved a copy of $PGDATA in 
its present, unworking, state).

The manpage for pg_resetxlog gives some general idea how it is used, and 
a way to estimate the next transaction id and wal segment. It does not 
have a way to recover next OID, but points out that the value of next 
OID isn't really all that critical. I was interested if we could do 
better than estimates, based on this comment in pg_resetxlog.c:
  /*   * XXX eventually, should try to grovel through old XLOG to develop   * more accurate values for startupid,
nextXID,and nextOID.   */
 

Based on the fact that we had an intact looking set of files in 
$PGDATA/pg_xlog, and that we were running 7.4, Tom offered to send a 
crude xlog dump tool that he has written for his own use (aside: this 
worked really well in my case; should it be part of the distribution?).

After trading a few off-list emails with Tom, I was able to find the
location of the last good checkpoint in the xlog files:

357/4EBFD094: prv 357/4EBFD04C; xprv 357/4EBFD04C; xid 3877222; RM 10 
info 00 len 34
insert: tblSpace 17144 rel 22084 block 557795 off 89
357/4EBFD0D8: prv 357/4EBFD094; xprv 357/4EBFD094; xid 3877222; RM 11 
info 00 len 38
357/4EBFD120: prv 357/4EBFD0D8; xprv 357/4EBFD0D8; xid 3877222; RM 10 
info 00 len 34
insert: tblSpace 17144 rel 22084 block 557795 off 90
357/4EBFD164: prv 357/4EBFD120; xprv 357/4EBFD120; xid 3877222; RM 11 
info 00 len 38
357/4EBFD1AC: prv 357/4EBFD164; xprv 0/00000000; xid 0; RM  0 info 10 len 32
checkpoint: redo 357/4EBFD1AC; undo 0/00000000; sui 58;  nextxid 3904963; nextoid 785429799; online at 2004-12-13
15:41:03PST
 
ReadRecord: record with zero len at 357/4EBFD1EC
Unexpected page info flags 0001 at offset BFE000
Skipping unexpected continuation record at offset BFE000
357/4EBFE024: prv 357/24BFDFC8(?); xprv 357/24BFDFC8; xid 3877261; RM 10 
info 00 len 45
insert: tblSpace 17144 rel 22114 block 2599 off 97
357/4EBFE074: prv 357/24BFE024(?); xprv 357/24BFE024; xid 3877261; RM 11 
info 00 len 38
357/4EBFE0BC: prv 357/24BFE074(?); xprv 357/24BFE074; xid 3877261; RM 10 
info 00 len 44
insert: tblSpace 17144 rel 22114 block 2599 off 98

On seeing the above, Tom said this"  "Yeah, this is exactly what we want: the "record with zero len" is the   actual
endof WAL, and the complaints after that betray the fact that   the next page isn't in sequence.  So it looks like you
wantto use   those nextxid and nextoid values, plus make sure that the starting   WAL filename is > 000001650000004E.
Ibelieve "-l 0x165,0x4F" should   handle that."
 

Based on that advice, here's what we ran:
pg_resetxlog -f -o 786000000 -x 0x3b95c3 -l 0x165,0x4f /replica/pgdata

We bumped the next OID up a bit just "for good measure" and knowing the 
specific value is less critical than the others. The pg_resetxlog 
command runs in about a second or two. Then we attempted to start 
Postgres -- and it started up fine -- whew!

Here's what the resulting pg_control file looked like:

# pg_controldata /replica/pgdata
pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               in production
pg_control last modified:             Sat Dec 18 13:18:08 2004
Current log file ID:                  369
Next log file segment:                94
Latest checkpoint location:           171/5D48F21C
Prior checkpoint location:            171/5AECF030
Latest checkpoint's REDO location:    171/5D48F21C
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        14
Latest checkpoint's NextXID:          3955133
Latest checkpoint's NextOID:          797476992
Time of latest checkpoint:            Sat Dec 18 13:18:06 2004
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

We then spent most of the next 24 hours reviewing the recovered 
database. The bulk data loading process was well instrumented, so we 
knew exactly which data should have been committed prior to the server 
hang, and which files were inprocess (we had been doing 10 loads in 
parallel) at the time of the hang. The results of the investigation 
indicated complete recovery, with no missing or unwanted (i.e. 
uncommitted records looking committed) data.


-------------------
Corrective Actions:
-------------------
As mentioned earlier, one action was to increase our swap partition to 
10 GB, based on Novell's advice. We also updated to the latest and 
greatest of SuSE 9 rpms, and IBM firmware. I believe we were already 
running the latest broadcom (network card) drivers, but I'm sure we 
upgraded those also if they weren't. This will all hopefully address the 
issue that caused the server hang in the first place. Won't know for 
sure until many months go by without incident.

Second change was to run:  chkconfig postgresql off
This sets Postgres to *not* automatically restart on a server restart. 
Our DBA pointed out that we do not allow Oracle to restart automatically 
precisely because it is on NFS mounted storage. A manual restart 
potentially requires a warm body in the middle of the night, but it 
ensures that we can check that the server and the NFS mount are stable 
before starting the database.

Third change we decided to make was to reduce the number of parallel 
bulk data loads. We have 8 CPUs, so probably something less than 8 is a 
good number to use. Of course I just now looked, and it appears that we 
have resumed loading with 10 processes, so I guess I'll have to follow 
up on that next Monday ;-).

Finally, we are running hyperthreaded, and I know others have reported 
that performance is actually better when hyperthreading is disabled. No 
firm decision has been made on that, but I also don't believe it is 
related to the problem we experienced. Of course, I could be wrong.

-------------------------------

That's it -- any comments, advice, or thoughts are welcome.

Joe




pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Port report: NetBSD 2.0 mac68k
Next
From: Bruce Momjian
Date:
Subject: Re: production server down