Resent, RE: newbie admin question: postgresql server cras hes - Mailing list pgsql-admin

From Murthy Kambhampaty
Subject Resent, RE: newbie admin question: postgresql server cras hes
Date
Msg-id E631530D51ABD411B823009027855C5B02798F@THOR
Whole thread Raw
List pgsql-admin
My earlier reply went to Tom Lane directly, rather than the list. Sorry.



Tom, thanks for the reply. I've attached the server log at the first
"crash", by which I meant that the server would shutdown with an "open of
.../pg_clog/nnnn failed" and restart (it's zipped becuase it's so long: the
server would restart and the script that "crashed" it would pickup and try
to access the corrupted table and it would crash again, till I stopped the
server). In the example below and in subsequent "crashes" -- server restarts
-- until I changed the "checkpoint_segments" value to 16 then 64 to increase
the checkpoint interval back to 5 minutes, the server restart seems to
follow a series of "recycled transaction log file ..." events. The table
being processed when the shutdown occurred was invariably corrupted (any
attempt to acces it, including a vacuum on it, would result in the same
"open of .../pg_clog/nnnn failed" message and a restart). Having tuned those
parameters, I expect the server to "run forever", but we got the "signal 11"
shutdowns we are now experiencing.

In further investigation, I found:
1. I found a website on the "sig 11 problem" which also implicated the
hardware. However, this hardware has been running for over a year, with a
lot of heavy database use (on MySQL) and several kernel compiles with no
problems that would implicate the hardware (users, on the other hand, will
be users, myself included). Also, the signal 11 FAQ says the error is most
likely to be triggered during kernel compiles, which I've done a fair amount
of (updating XFS and adding networking hardware). We did have two processor
with stepping 02 and two with stepping 03 ;(, but we're down to two with
stepping 02 now, and the problem recurred last night, after the change (I
can see if the problem occurs with the other two processors, but ...).

2. At the Linux Database How-To, the section on postgresql
(http://homepages.gold.ac.uk/marian-ursu/manuals/howto-linux/Database-HOWTO-
34.html) says:
"34.35 How do I tune the database engine for better performance?
There are two things that can be done. You can use Openlink's option to
disable fsync() by starting the postmaster with a '-o -F' option. This will
prevent fsync()'s from flushing to disk after every transaction.
You can also use the postmaster -B option to increase the number of shared
memory buffers shared among the backend processes. If you make this
parameter too high, the process will not start or crash unexpectedly. Each
buffer is 8K and the defualt is 64 buffers."

Well, we've got 4GB of RAM and:
shared_buffers=8192 (128MB)
sort_mem=131072 (128MB)
vacuum_mem = 81920 (80MB)
wal_files = 64
checkpoint_segments = 64 (1G)

so, that's 1.25GB max (since we install pgsql it's been used only by a
single user, except for the occasional "select * from <tbl_nm> limit 100" to
make sure things look right, but the restarts happen regardless) unless
we're also doing a vacuum, in which case its just over 1.33G, so we're safe,
no?

As to producing a stack backtrace, the postmaster restarts after the problem
...?

Thanks to everybody for the help, and I do hope we can solve this and we can
switch to postgresql for good. (We really like the SQL conformance, since we
also use SQL in a product called SAS, educating users on how to get around
MySQL's peculiar choices on SQL implementation doesn't work; we think the
GEQO will help us out with our  ad-hoc queries which tend to involve
conditional joins on mutliple tables; being able to run pgdump while queries
are running overnight is also a critical advantage. And one of these days
the R-language may be one of the languages supported by postgresql and we
can do much of our statistical analysis within postgresql!)


    Murthy




>-----Original Message-----
>From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
>Sent: Friday, August 02, 2002 00:36
>To: Murthy Kambhampaty
>Cc: 'pgsql-admin@postgresql.org'
>Subject: Re: [ADMIN] newbie admin question: postgresql server crashes
>
>
>Murthy Kambhampaty <Murthy.Kambhampaty@goeci.com> writes:
>> Our postgresql server crashed this morning unexpectedly.
>Attached is our
>> server log from before the crash. What's perplexing is the sequence
>> "2002-08-01 10:17:30 DEBUG:  server process (pid 13715) was
>terminated by
>> signal 11
>
>Core file?  Stack backtrace?
>
>> 2002-08-01 10:17:30 DEBUG:  database system was interrupted
>at 2002-08-01
>> 05:30:41 EDT"
>
>> But there is nor report of any problems at 05:30:41 EDT.
>
>I believe that's effectively the time of the last checkpoint.  If the
>system was idle between 5:30 and 10:15 or so then there's nothing very
>surprising there.  Tell us about the core dump.
>
>> I should note that we deployed this server last week with
>> checkpoint_segments=3 and fsync off. The server crashed
>repeatedly on file
>> imports when the checkpoing interval dropped below 5 secs.
>
>Define "crash".  Again, stack backtraces would be useful.
>
>FWIW, a lot of the reports of instability we've seen recently have
>traced to hardware problems.  How confident are you of this hardware?
>
>            regards, tom lane
>


Attachment

pgsql-admin by date:

Previous
From: Tim Ellis
Date:
Subject: Re: pgsql for Solaris 8 (SPARC)
Next
From: Steve Doerr
Date:
Subject: Restore question