BUG #2023: Assertion Failure: File: "slru.c", Line: 309 - Mailing list pgsql-bugs

From Joel Stevenson
Subject BUG #2023: Assertion Failure: File: "slru.c", Line: 309
Date
Msg-id 20051107144059.163A9F0FD2@svr2.postgresql.org
Whole thread Raw
Responses Re: BUG #2023: Assertion Failure: File: "slru.c", Line: 309
List pgsql-bugs
The following bug has been logged online:

Bug reference:      2023
Logged by:          Joel Stevenson
Email address:      joelstevenson@mac.com
PostgreSQL version: 8.1RC1
Operating system:   RHEL 3 update 6
Description:        Assertion Failure: File: "slru.c", Line: 309
Details:

Hi,

I'm running 8.1RC1 on a RHEL 3 machine (dual proc Xeon w/2G of ram).  Under
extremely heavy load (for the hardware) Postgres is periodically throwing
the following assertion error:

TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c",
Line: 309)

"Extremely heavy load" for this machine means that basically PG is operating
at it's max_clients threshold, serving Apache web clients.  Each client is
accessing the same tables which are setup as a work queue.  top shows that
prior to the assertion failures    the load average on the machine spikes to
60ish through 115ish, the postgres processes all seem to be consuming 1.1 -
2.2% of CPU, both CPUs (all four including I guess the coprocessors as shown
by top) are maxed at basically 100%.  The machine is not swapping and
maintains a small but comfortble amount of free RAM during these peaks.

There is periodically very high contention between the web clients as shown
by a large number of ungranted locks on the same tuple; I have strategies in
place to try to keep this contention under control but the nature of this
work queue is such that times of high contention are expected.  The
contention is caused by the clients performing SELECT ... FOR UPDATE on the
same table on recently inserted records which are not yet claimed.

The assertion failure has occurred 3 times in the last 2 hours and on the
third occurance I was able to get a core dump saved during the crash.  I'm
not a C programmer and so don't know exactly what might be useful (or how to
get it) from the core file but 'bt' gives:

#0  0x005c3eff in ?? ()
#1  0x00609c8d in ?? ()
#2  0x006d1cd8 in ?? ()
#3  0x00000000 in ?? ()

If further information from the core file is needed, please let me know and
I'll try to get it (suggestions on how to get the info via gdb are also most
welcome :) ).


Postgres was configured using both --enable-debug and --enable-cassert.
Full config options were:

./configure CFLAGS=-O2 -pipe --with-perl --with-openssl
--enable-thread-safety --enable-debug --enable-cassert
--with-includes=/usr/kerberos/include

Some non-default postgresql.conf params:
max_connections = 150
ssl = on
shared_buffers = 4000
work_mem = 102400
maintenance_work_mem = 131072
max_stack_depth = 4096
commit_delay = 100
checkpoint_segments = 5
effective_cache_size = 173015
stats_start_collector = on
stats_command_string = on
stats_block_level = on
stats_row_level = on
stats_reset_on_server_start = on
autovacuum = on
autovacuum_analyze_scale_factor = 0.1


Log lines:
2005-11-07 05:52:36 PST 23611 LOG:  08P01: unexpected EOF on client
connection
2005-11-07 05:52:36 PST 23611 LOCATION:  SocketBackend, postgres.c:292
2005-11-07 05:52:36 PST 23611 LOG:  00000: disconnection: session time:
0:07:09.17 user=joels database=joels host=[local]
2005-11-07 05:52:36 PST 23611 LOCATION:  log_disconnections,
postgres.c:3538
TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c",
Line: 309)
2005-11-07 05:53:17 PST 23426 LOG:  00000: server process (PID 23577) was
terminated by signal 6
2005-11-07 05:53:17 PST 23426 LOCATION:  LogChildExit, postmaster.c:2426
2005-11-07 05:53:17 PST 23426 LOG:  00000: terminating any other active
server processes
2005-11-07 05:53:17 PST 23426 LOCATION:  HandleChildCrash,
postmaster.c:2307
2005-11-07 05:53:17 PST 23789 WARNING:  57P02: terminating connection
because of crash of another server process
2005-11-07 05:53:17 PST 23789 DETAIL:  The postmaster has commanded this
server process to roll back the current transaction and exit, because
another server pr
ocess exited abnormally and possibly corrupted shared memory.

Thanks,
Joel

pgsql-bugs by date:

Previous
From: "Nikolaos Papageorgiou"
Date:
Subject: BUG #2026: Greek Locale Error
Next
From: Lars Kanis
Date:
Subject: Problems with index-scan on regexp in 8.1