BUG #2023: Assertion Failure: File: "slru.c", Line: 309 - Mailing list pgsql-bugs
From | Joel Stevenson |
---|---|
Subject | BUG #2023: Assertion Failure: File: "slru.c", Line: 309 |
Date | |
Msg-id | 20051107144059.163A9F0FD2@svr2.postgresql.org Whole thread Raw |
Responses |
Re: BUG #2023: Assertion Failure: File: "slru.c", Line: 309
|
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 2023 Logged by: Joel Stevenson Email address: joelstevenson@mac.com PostgreSQL version: 8.1RC1 Operating system: RHEL 3 update 6 Description: Assertion Failure: File: "slru.c", Line: 309 Details: Hi, I'm running 8.1RC1 on a RHEL 3 machine (dual proc Xeon w/2G of ram). Under extremely heavy load (for the hardware) Postgres is periodically throwing the following assertion error: TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", Line: 309) "Extremely heavy load" for this machine means that basically PG is operating at it's max_clients threshold, serving Apache web clients. Each client is accessing the same tables which are setup as a work queue. top shows that prior to the assertion failures the load average on the machine spikes to 60ish through 115ish, the postgres processes all seem to be consuming 1.1 - 2.2% of CPU, both CPUs (all four including I guess the coprocessors as shown by top) are maxed at basically 100%. The machine is not swapping and maintains a small but comfortble amount of free RAM during these peaks. There is periodically very high contention between the web clients as shown by a large number of ungranted locks on the same tuple; I have strategies in place to try to keep this contention under control but the nature of this work queue is such that times of high contention are expected. The contention is caused by the clients performing SELECT ... FOR UPDATE on the same table on recently inserted records which are not yet claimed. The assertion failure has occurred 3 times in the last 2 hours and on the third occurance I was able to get a core dump saved during the crash. I'm not a C programmer and so don't know exactly what might be useful (or how to get it) from the core file but 'bt' gives: #0 0x005c3eff in ?? () #1 0x00609c8d in ?? () #2 0x006d1cd8 in ?? () #3 0x00000000 in ?? () If further information from the core file is needed, please let me know and I'll try to get it (suggestions on how to get the info via gdb are also most welcome :) ). Postgres was configured using both --enable-debug and --enable-cassert. Full config options were: ./configure CFLAGS=-O2 -pipe --with-perl --with-openssl --enable-thread-safety --enable-debug --enable-cassert --with-includes=/usr/kerberos/include Some non-default postgresql.conf params: max_connections = 150 ssl = on shared_buffers = 4000 work_mem = 102400 maintenance_work_mem = 131072 max_stack_depth = 4096 commit_delay = 100 checkpoint_segments = 5 effective_cache_size = 173015 stats_start_collector = on stats_command_string = on stats_block_level = on stats_row_level = on stats_reset_on_server_start = on autovacuum = on autovacuum_analyze_scale_factor = 0.1 Log lines: 2005-11-07 05:52:36 PST 23611 LOG: 08P01: unexpected EOF on client connection 2005-11-07 05:52:36 PST 23611 LOCATION: SocketBackend, postgres.c:292 2005-11-07 05:52:36 PST 23611 LOG: 00000: disconnection: session time: 0:07:09.17 user=joels database=joels host=[local] 2005-11-07 05:52:36 PST 23611 LOCATION: log_disconnections, postgres.c:3538 TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", Line: 309) 2005-11-07 05:53:17 PST 23426 LOG: 00000: server process (PID 23577) was terminated by signal 6 2005-11-07 05:53:17 PST 23426 LOCATION: LogChildExit, postmaster.c:2426 2005-11-07 05:53:17 PST 23426 LOG: 00000: terminating any other active server processes 2005-11-07 05:53:17 PST 23426 LOCATION: HandleChildCrash, postmaster.c:2307 2005-11-07 05:53:17 PST 23789 WARNING: 57P02: terminating connection because of crash of another server process 2005-11-07 05:53:17 PST 23789 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server pr ocess exited abnormally and possibly corrupted shared memory. Thanks, Joel
pgsql-bugs by date: