Resent, RE: newbie admin question: postgresql server cras hes - Mailing list pgsql-admin
From | Murthy Kambhampaty |
---|---|
Subject | Resent, RE: newbie admin question: postgresql server cras hes |
Date | |
Msg-id | E631530D51ABD411B823009027855C5B02798F@THOR Whole thread Raw |
List | pgsql-admin |
My earlier reply went to Tom Lane directly, rather than the list. Sorry. Tom, thanks for the reply. I've attached the server log at the first "crash", by which I meant that the server would shutdown with an "open of .../pg_clog/nnnn failed" and restart (it's zipped becuase it's so long: the server would restart and the script that "crashed" it would pickup and try to access the corrupted table and it would crash again, till I stopped the server). In the example below and in subsequent "crashes" -- server restarts -- until I changed the "checkpoint_segments" value to 16 then 64 to increase the checkpoint interval back to 5 minutes, the server restart seems to follow a series of "recycled transaction log file ..." events. The table being processed when the shutdown occurred was invariably corrupted (any attempt to acces it, including a vacuum on it, would result in the same "open of .../pg_clog/nnnn failed" message and a restart). Having tuned those parameters, I expect the server to "run forever", but we got the "signal 11" shutdowns we are now experiencing. In further investigation, I found: 1. I found a website on the "sig 11 problem" which also implicated the hardware. However, this hardware has been running for over a year, with a lot of heavy database use (on MySQL) and several kernel compiles with no problems that would implicate the hardware (users, on the other hand, will be users, myself included). Also, the signal 11 FAQ says the error is most likely to be triggered during kernel compiles, which I've done a fair amount of (updating XFS and adding networking hardware). We did have two processor with stepping 02 and two with stepping 03 ;(, but we're down to two with stepping 02 now, and the problem recurred last night, after the change (I can see if the problem occurs with the other two processors, but ...). 2. At the Linux Database How-To, the section on postgresql (http://homepages.gold.ac.uk/marian-ursu/manuals/howto-linux/Database-HOWTO- 34.html) says: "34.35 How do I tune the database engine for better performance? There are two things that can be done. You can use Openlink's option to disable fsync() by starting the postmaster with a '-o -F' option. This will prevent fsync()'s from flushing to disk after every transaction. You can also use the postmaster -B option to increase the number of shared memory buffers shared among the backend processes. If you make this parameter too high, the process will not start or crash unexpectedly. Each buffer is 8K and the defualt is 64 buffers." Well, we've got 4GB of RAM and: shared_buffers=8192 (128MB) sort_mem=131072 (128MB) vacuum_mem = 81920 (80MB) wal_files = 64 checkpoint_segments = 64 (1G) so, that's 1.25GB max (since we install pgsql it's been used only by a single user, except for the occasional "select * from <tbl_nm> limit 100" to make sure things look right, but the restarts happen regardless) unless we're also doing a vacuum, in which case its just over 1.33G, so we're safe, no? As to producing a stack backtrace, the postmaster restarts after the problem ...? Thanks to everybody for the help, and I do hope we can solve this and we can switch to postgresql for good. (We really like the SQL conformance, since we also use SQL in a product called SAS, educating users on how to get around MySQL's peculiar choices on SQL implementation doesn't work; we think the GEQO will help us out with our ad-hoc queries which tend to involve conditional joins on mutliple tables; being able to run pgdump while queries are running overnight is also a critical advantage. And one of these days the R-language may be one of the languages supported by postgresql and we can do much of our statistical analysis within postgresql!) Murthy >-----Original Message----- >From: Tom Lane [mailto:tgl@sss.pgh.pa.us] >Sent: Friday, August 02, 2002 00:36 >To: Murthy Kambhampaty >Cc: 'pgsql-admin@postgresql.org' >Subject: Re: [ADMIN] newbie admin question: postgresql server crashes > > >Murthy Kambhampaty <Murthy.Kambhampaty@goeci.com> writes: >> Our postgresql server crashed this morning unexpectedly. >Attached is our >> server log from before the crash. What's perplexing is the sequence >> "2002-08-01 10:17:30 DEBUG: server process (pid 13715) was >terminated by >> signal 11 > >Core file? Stack backtrace? > >> 2002-08-01 10:17:30 DEBUG: database system was interrupted >at 2002-08-01 >> 05:30:41 EDT" > >> But there is nor report of any problems at 05:30:41 EDT. > >I believe that's effectively the time of the last checkpoint. If the >system was idle between 5:30 and 10:15 or so then there's nothing very >surprising there. Tell us about the core dump. > >> I should note that we deployed this server last week with >> checkpoint_segments=3 and fsync off. The server crashed >repeatedly on file >> imports when the checkpoing interval dropped below 5 secs. > >Define "crash". Again, stack backtraces would be useful. > >FWIW, a lot of the reports of instability we've seen recently have >traced to hardware problems. How confident are you of this hardware? > > regards, tom lane >
Attachment
pgsql-admin by date: