Thread: BUG #2033: Assertion Failure: File: "procarray.c", Line: 492
The following bug has been logged online: Bug reference: 2033 Logged by: Joel Stevenson Email address: joelstevenson@mac.com PostgreSQL version: 8.1.0 Operating system: RHEL 3 update 6 Description: Assertion Failure: File: "procarray.c", Line: 492 Details: Hi, I'm running into an Assertion failure this morning w/8.1.0. I believe it is related to using the NOWAIT flag. Here is the log message: TRAP: FailedAssertion("!(serializable ? !((MyProc->xmin) != ((TransactionId) 0)) : ((MyProc->xmin) != ((TransactionId) 0)))", File: "procarray.c", Line: 492) Postgres was configured using both --enable-debug and --enable-cassert. Full config options were: ./configure CFLAGS=-O2 -pipe --with-perl --with-openssl --enable-thread-safety --enable-debug --enable-cassert --with-includes=/usr/kerberos/include Some non-default postgresql.conf params: max_connections = 150 ssl = on shared_buffers = 4000 work_mem = 102400 maintenance_work_mem = 131072 max_stack_depth = 4096 commit_delay = 100 checkpoint_segments = 5 effective_cache_size = 173015 stats_start_collector = on stats_command_string = on stats_block_level = on stats_row_level = on stats_reset_on_server_start = on autovacuum = on autovacuum_analyze_scale_factor = 0.1 I've removed the 'NOWAIT' for the time being, but thought I should mention the issue. Thanks, Joel
"Joel Stevenson" <joelstevenson@mac.com> writes: > I'm running into an Assertion failure this morning w/8.1.0. I believe it is > related to using the NOWAIT flag. Here is the log message: > TRAP: FailedAssertion("!(serializable ? !((MyProc->xmin) != ((TransactionId) > 0)) > : ((MyProc->xmin) != ((TransactionId) 0)))", File: "procarray.c", Line: > 492) With no stack trace, and no information about how to reproduce the error, I'm afraid this report is just about useless :-( regards, tom lane
Sorry, I realize this bug is a little lean on details. PG is set up to serve multiple concurrent processes in a work queue fashion, therefore these processes are in contention for the same resources as the queue is basically a FIFO and I'm using SELECT ... FOR UPDATE NOWAIT in each client to grab the next work segment in line. I've been running the setup on 8.1b3 and 8.1RC1 without this particular Assertion failure, but had not been using the NOWAIT flag until today. During a period of perhaps 20 processes operating on the queue it looks like postgres failed this assertion 17 times, the first coming very shortly after the processes began and the rest following in quick succession. I do have the core files if further info is needed from them. Thanks, Joel At 3:19 PM +0000 11/9/05, Joel Stevenson wrote: >The following bug has been logged online: > >Bug reference: 2033 >Logged by: Joel Stevenson >Email address: joelstevenson@mac.com >PostgreSQL version: 8.1.0 >Operating system: RHEL 3 update 6 >Description: Assertion Failure: File: "procarray.c", Line: 492 >Details: > >Hi, > >I'm running into an Assertion failure this morning w/8.1.0. I believe it is >related to using the NOWAIT flag. Here is the log message: > >TRAP: FailedAssertion("!(serializable ? !((MyProc->xmin) != ((TransactionId) >0)) > : ((MyProc->xmin) != ((TransactionId) 0)))", File: "procarray.c", Line: >492) > > >Postgres was configured using both --enable-debug and --enable-cassert. >Full config options were: > >./configure CFLAGS=-O2 -pipe --with-perl --with-openssl >--enable-thread-safety --enable-debug --enable-cassert >--with-includes=/usr/kerberos/include > >Some non-default postgresql.conf params: >max_connections = 150 >ssl = on >shared_buffers = 4000 >work_mem = 102400 >maintenance_work_mem = 131072 >max_stack_depth = 4096 >commit_delay = 100 >checkpoint_segments = 5 >effective_cache_size = 173015 >stats_start_collector = on >stats_command_string = on >stats_block_level = on >stats_row_level = on >stats_reset_on_server_start = on >autovacuum = on >autovacuum_analyze_scale_factor = 0.1 > >I've removed the 'NOWAIT' for the time being, but thought I should mention >the issue. > >Thanks, >Joel > >---------------------------(end of broadcast)--------------------------- >TIP 4: Have you searched our list archives? > > http://archives.postgresql.org
A backtrace from one of the cores would be a good start. On Wed, Nov 09, 2005 at 07:44:01AM -0800, Joel Stevenson wrote: > Sorry, I realize this bug is a little lean on details. > > PG is set up to serve multiple concurrent processes in a work queue > fashion, therefore these processes are in contention for the same > resources as the queue is basically a FIFO and I'm using SELECT ... > FOR UPDATE NOWAIT in each client to grab the next work segment in > line. > > I've been running the setup on 8.1b3 and 8.1RC1 without this > particular Assertion failure, but had not been using the NOWAIT flag > until today. During a period of perhaps 20 processes operating on > the queue it looks like postgres failed this assertion 17 times, the > first coming very shortly after the processes began and the rest > following in quick succession. > > I do have the core files if further info is needed from them. > > Thanks, > Joel > > > At 3:19 PM +0000 11/9/05, Joel Stevenson wrote: > >The following bug has been logged online: > > > >Bug reference: 2033 > >Logged by: Joel Stevenson > >Email address: joelstevenson@mac.com > >PostgreSQL version: 8.1.0 > >Operating system: RHEL 3 update 6 > >Description: Assertion Failure: File: "procarray.c", Line: 492 > >Details: > > > >Hi, > > > >I'm running into an Assertion failure this morning w/8.1.0. I believe it > >is > >related to using the NOWAIT flag. Here is the log message: > > > >TRAP: FailedAssertion("!(serializable ? !((MyProc->xmin) != > >((TransactionId) > >0)) > > : ((MyProc->xmin) != ((TransactionId) 0)))", File: "procarray.c", Line: > >492) > > > > > >Postgres was configured using both --enable-debug and --enable-cassert. > >Full config options were: > > > >./configure CFLAGS=-O2 -pipe --with-perl --with-openssl > >--enable-thread-safety --enable-debug --enable-cassert > >--with-includes=/usr/kerberos/include > > > >Some non-default postgresql.conf params: > >max_connections = 150 > >ssl = on > >shared_buffers = 4000 > >work_mem = 102400 > >maintenance_work_mem = 131072 > >max_stack_depth = 4096 > >commit_delay = 100 > >checkpoint_segments = 5 > >effective_cache_size = 173015 > >stats_start_collector = on > >stats_command_string = on > >stats_block_level = on > >stats_row_level = on > >stats_reset_on_server_start = on > >autovacuum = on > >autovacuum_analyze_scale_factor = 0.1 > > > >I've removed the 'NOWAIT' for the time being, but thought I should mention > >the issue. > > > >Thanks, > >Joel > > > >---------------------------(end of broadcast)--------------------------- > >TIP 4: Have you searched our list archives? > > > > http://archives.postgresql.org > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
At 11:42 AM -0500 11/9/05, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >> gdb's 'bt' on one of the core files produces: > >> #0 0x00138eff in ?? () >> #1 0x0017ec8d in ?? () >> #2 0x00246cd8 in ?? () >> #3 0x00000000 in ?? () > >This looks like you have a "stripped" executable, which is a bit odd >considering you said that you built with --enable-debug. Please >check that the installed executable is what you think it is ... Hmmm, I've moved all previous (beta, RC) PG base directories and re-installed 8.1.0 again with --enable-debug and --enable-cassert. During the make process I saw the cc lines floating past with the -g option. The commands were: $ ./configure --with-perl --with-openssl --enable-thread-safety --enable-debug --enable-cassert --with-includes=/usr/kerberos/include $ make -j4 $ sudo make install $ cp /usr/local/pgsql.old/data /usr/local/pgsql $ sudo su postgres $ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data start $ pkill -ABRT postmaster This produced a bunch of core files that show the following backtrace: #0 0x001ea038 in ?? () #1 0xbfffa4d8 in ?? () #2 0xbfffa5e0 in ?? () #3 0xbfffa560 in ?? () #4 0x08180844 in ?? () #5 0x00000005 in ?? () #6 0xbfffa4e0 in ?? () #7 0x00000000 in ?? () I'm at a loss as to what to do about it, really; is there a hidden configure flag or something that could be in my environment that's causing the executable to be stripped? I see various strip related things within the Makefile, but I definitely didn't make the install-strip target. >Is there any chance of putting together a self-contained test case >that other people could try? The simple test case so far is not causing the assertion failure, perhaps because it is just a simple approximation of the case. At this point I'm more concerned about getting a non-stripped PG installed so that should this error or another occur the core file will have perhaps helpful information. Thanks in advance for any pointers, Joel
Joel Stevenson <joelstevenson@mac.com> writes: > This produced a bunch of core files that show the following backtrace: > #0 0x001ea038 in ?? () > #1 0xbfffa4d8 in ?? () > #2 0xbfffa5e0 in ?? () > #3 0xbfffa560 in ?? () > #4 0x08180844 in ?? () > #5 0x00000005 in ?? () > #6 0xbfffa4e0 in ?? () > #7 0x00000000 in ?? () > I'm at a loss as to what to do about it, really; is there a hidden > configure flag or something that could be in my environment that's > causing the executable to be stripped? Nondefault LDFLAGS maybe? Look at the commands that link all the SUBSYS.o files to produce the postgres executable. Also, on most Unixen "file /path/to/postgres" will tell the difference between a stripped and nonstripped executable. If "file" says that the executable isn't stripped, then the problem is either with gdb itself or with the way you're invoking it. Could we see a cut-and-paste of the whole terminal session with gdb, rather than just the result of the bt part? regards, tom lane
Thanks Tom, and apologies for the confusion; not having worked w/gdb before I was invoking it incorrectly. Here's what the core file produced from the assertion failure has for a backtrace: $ gdb ./bin/postgres data/core.20633 (gdb) bt #0 0x00138eff in raise () from /lib/tls/libc.so.6 #1 0x0013a705 in abort () from /lib/tls/libc.so.6 #2 0x0820dc66 in ExceptionalCondition ( conditionName=0x6 <Address 0x6 out of bounds>, errorType=0x5099 <Address 0x5099 out of bounds>, fileName=0x0, lineNumber=492) at assert.c:51 #3 0x081987bd in GetSnapshotData (snapshot=0x8303f04, serializable=0 '\0') at procarray.c:514 #4 0x082293df in GetTransactionSnapshot () at tqual.c:1265 #5 0x081a8eb8 in PortalStart (portal=0x9b09a24, params=0x9b1c5dc, snapshot=0x0) at pquery.c:375 #6 0x081a6286 in exec_bind_message (input_message=0xbfffa2e0) at postgres.c:1643 #7 0x081a7fb2 in PostgresMain (argc=4, argv=0x9aaa50c, username=0x9aaa4e4 "joels") at postgres.c:3205 #8 0x08182a45 in BackendRun (port=0x9ac8d58) at postmaster.c:2854 #9 0x08182529 in BackendStartup (port=0x9ac8d58) at postmaster.c:2498 #10 0x08180a57 in ServerLoop () at postmaster.c:1231 #11 0x0817fee5 in PostmasterMain (argc=3, argv=0x9aa8478) at postmaster.c:943 #12 0x08147a37 in main (argc=3, argv=0x9aa8478) at main.c:256 -Joel
Joel Stevenson <joelstevenson@mac.com> writes: > Thanks Tom, and apologies for the confusion; not having worked w/gdb > before I was invoking it incorrectly. Here's what the core file > produced from the assertion failure has for a backtrace: OK, now that that's straightened out, there are some more things to look at in the core file. Please try these gdb commands: p *MyProc p *SerializableSnapshot p *CurrentTransactionState regards, tom lane
At 3:24 PM -0500 11/9/05, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >> Thanks Tom, and apologies for the confusion; not having worked w/gdb >> before I was invoking it incorrectly. Here's what the core file >> produced from the assertion failure has for a backtrace: > >OK, now that that's straightened out, there are some more things to >look at in the core file. Please try these gdb commands: > > p *MyProc > p *SerializableSnapshot > p *CurrentTransactionState > > regards, tom lane Ok, here are the results: (gdb) p *MyProc $1 = {links = {prev = 4294967295, next = 4294967295}, sem = {semId = 8880136, semNum = 10}, waitStatus = 0, xid = 0, xmin = 0, pid = 20633, databaseId = 16385, roleId = 16384, lwWaiting = 0 '\0', lwExclusive = 1 '\001', lwWaitLink = 0x0, waitLock = 0x0, waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, procLocks = {prev = 34618656, next = 34618656}, subxids = { overflowed = 0 '\0', nxids = 0, xids = {0 <repeats 64 times>}}} (gdb) p *SerializableSnapshot $2 = {xmin = 1068152, xmax = 1068155, xcnt = 2, xip = 0x9b142b8, curcid = 0} (gdb) p *CurrentTransactionState $3 = {transactionId = 1068154, subTransactionId = 1, name = 0x0, savepointLevel = 0, state = TRANS_ABORT, blockState = TBLOCK_ABORT, nestingLevel = 1, curTransactionContext = 0x9b06a9c, curTransactionOwner = 0x9ab0294, childXids = 0x0, currentUser = 0, prevXactReadOnly = 0 '\0', parent = 0x0} -Joel
Joel Stevenson <joelstevenson@mac.com> writes: > (gdb) p *CurrentTransactionState > $3 = {transactionId = 1068154, subTransactionId = 1, name = 0x0, > savepointLevel = 0, > state = TRANS_ABORT, blockState = TBLOCK_ABORT, nestingLevel = 1, > curTransactionContext = 0x9b06a9c, curTransactionOwner = 0x9ab0294, > childXids = 0x0, currentUser = 0, prevXactReadOnly = 0 '\0', parent = 0x0} Ah-hah, that tells the tale: you're in an already-failed transaction, and so there are large parts of the backend that Just Won't Work. What is missing here, I think, is a test in exec_bind_message() similar to what's in exec_execute_message(): bail out early if the transaction is aborted and the command in the portal is anything but ROLLBACK/COMMIT/PREPARE. In this case you are trying to bind to a portal containing a SELECT statement (the line number in pquery.c proves this), and since you couldn't Execute that, there seems little need to let you Bind it. Kinda surprising this hasn't been reported before; the bug or something close to it doubtless exists in 7.4 as well. regards, tom lane
Joel Stevenson <joelstevenson@mac.com> writes: > I've been running the setup on 8.1b3 and 8.1RC1 without this > particular Assertion failure, but had not been using the NOWAIT flag > until today. During a period of perhaps 20 processes operating on > the queue it looks like postgres failed this assertion 17 times, the > first coming very shortly after the processes began and the rest > following in quick succession. BTW, I imagine that the apparent correlation to NOWAIT is occurring because your client-side code was not checking to see if the NOWAIT query had failed before trying to launch another query in the same transaction. regards, tom lane
At 4:09 PM -0500 11/9/05, Tom Lane wrote: >Joel Stevenson <joelstevenson@mac.com> writes: >> I've been running the setup on 8.1b3 and 8.1RC1 without this >> particular Assertion failure, but had not been using the NOWAIT flag >> until today. During a period of perhaps 20 processes operating on >> the queue it looks like postgres failed this assertion 17 times, the >> first coming very shortly after the processes began and the rest >> following in quick succession. > >BTW, I imagine that the apparent correlation to NOWAIT is occurring >because your client-side code was not checking to see if the NOWAIT >query had failed before trying to launch another query in the same >transaction. That is indeed what the code was doing. I'd added NOWAIT to the SELECT statement and was catching the error that was being thrown from the NOWAIT behavior, but didn't think that the error would invalidate the current transaction (which was not a prepared transaction.) The fact that it does kill the transaction makes perfect sense. I'd been thinking of NOWAIT all wrong: as a "return on block", but it's an "error on block". Thanks again, Tom. -Joel