Thread: ERROR: infinite recursion in proc_exit
Hi all! On a number of occasions over the past couple of weeks I have encountered the following error during my nightly vacuum process. It repeats infinitely until outside intervention. It seems to identify the fact that it is stuck in an infinite recursion but unable to deal with it. The relevant portion of the log is at the end of the message. Any hints/suggestions are welcome. Also, after exiting from the infinite loop I have the following errors appearing during vacuum and vacuum analyze: NOTICE: CreatePortal: portal <vacuum> already exists NOTICE: Index pg_attribute_attrelid_index: NUMBER OF INDEX' TUPLES (1203) IS NOT THE SAME AS HEAP' (1193) NOTICE: Index pg_attribute_relid_attnum_index: NUMBER OF INDEX' TUPLES (1203) IS NOT THE SAME AS HEAP' (1193) NOTICE: Index pg_attribute_relid_attnam_index: NUMBER OF INDEX' TUPLES (1203) IS NOT THE SAME AS HEAP' (1193) ERROR: cannot find attribute 1 of relation pg_temp.13894.119 Any suggestions on cleaning this up is appreciated... - Kristofer Nov 4 03:37:00 mymailman logger: DEBUG: --Relation tblvalue-- Nov 4 03:37:04 mymailman logger: DEBUG: Pages 945: Changed 83, Reapped 5, Empty 0, New 0; Tup 128483: Vac 13, Keep/VTL0/0, Crash 0, UnUsed 0, MinLen 53, MaxLen 85; Re-using: Free/Avail. Space 876/876; EndEmpty/Avail. Pages 0/5. Elapsed0/0 sec. Nov 4 03:37:08 mymailman logger: DEBUG: Index tblvalue_idx2: Pages 432; Tuples 128483: Deleted 13. Elapsed 0/0 sec. Nov 4 03:37:10 mymailman logger: DEBUG: Index tblvalue_oid: Pages 289; Tuples 128483: Deleted 13. Elapsed 0/1 sec. Nov 4 03:37:20 mymailman logger: ERROR: cannot find attribute 1 of relation pg_temp.13894.119 Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type ? unknown Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type ? unknown Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type ? unknown Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type ? unknown Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: infinite recursion in proc_exit Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: infinite recursion in proc_exit Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: infinite recursion in proc_exit Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor Nov 4 03:37:20 mymailman logger: ERROR: infinite recursion in proc_exit - K Kristofer Munn * KMI * 973-509-9414 * AIM KrMunn * ICQ 352499 * www.munn.com
Kristofer Munn <kmunn@munn.com> writes: > On a number of occasions over the past couple of weeks I have encountered > the following error during my nightly vacuum process. It repeats > infinitely until outside intervention. It seems to identify the fact that > it is stuck in an infinite recursion but unable to deal with it. What Postgres version are you using? This loop: > Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor > Nov 4 03:37:20 mymailman logger: FATAL 1: Socket command type ? unknown > Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor > Nov 4 03:37:20 mymailman logger: ERROR: unknown frontend message was received > Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor > Nov 4 03:37:20 mymailman logger: NOTICE: AbortTransaction and not in in-progress state > Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor > Nov 4 03:37:20 mymailman logger: pq_flush: send() failed: Bad file descriptor > Nov 4 03:37:20 mymailman logger: pq_recvbuf: recv() failed: Bad file descriptor looks like we are trying to read a command from the frontend, failing because the socket descriptor's been clobbered, trying to send the elog() report to the client --- which also fails of course --- returning to the main loop and failing again. But as far as I can tell from looking at either 6.5 or current code, that can't happen: a failure return from pq_recvbuf should lead to proc_exit *without* reaching the 'Socket command type ? unknown' elog. So I think you are working with 6.4 or older code, in which case an update would be your best bet. If you want to try to recover without doing an update, I think you'll still need to do a pg_dump/destroydb/createdb/reload. It looks like the indexes on pg_attribute have been corrupted, and there's not any easier way to clean that up. (If it were a user table, you could just drop and recreate the index, but don't try that on pg_attribute...) regards, tom lane
> What Postgres version are you using? My apologies, should have included that with my original request: [PostgreSQL 6.5.2 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66] > If you want to try to recover without doing an update, I think you'll > still need to do a pg_dump/destroydb/createdb/reload. It looks like > the indexes on pg_attribute have been corrupted, and there's not any > easier way to clean that up. (If it were a user table, you could just > drop and recreate the index, but don't try that on pg_attribute...) What are the ramifications of continuing with the corrupted indexes - undefined behavior? Filesystems have fsck to fix stuff - are there any tools on the docket to reconstruct the indexes or other recoverable things? These would be useful for systems where the database is 500+ Megs and takes quite awhile to reload. The application involved uses temp files (you see some sort of attribute failure in the VACUUM before everything goes haywire - perhaps unrelated) as well as transactions. I don't create or drop temp tables inside transactions to avoid the accompanying error messages. A database maintenance program runs nightly to cull old data from the database and then runs a vacuum. During that time, transactions continue unabated. - K Kristofer Munn * KMI * 973-509-9414 * AIM KrMunn * ICQ 352499 * www.munn.com
Kristofer Munn <kmunn@munn.com> writes: >> What Postgres version are you using? > My apologies, should have included that with my original request: > [PostgreSQL 6.5.2 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66] Hmm. If that trace is from 6.5 code, then postgres.c should certainly be calling proc_exit after the recv() fails. I wonder if proc_exit is returning because proc_exit_inprogress is nonzero? proc_exit's use of elog(ERROR) does look mighty bogus to me --- that path could possibly cause a recursion just like this, but how did the code get into it to begin with? But that's not very relevant to your real problem, which is that there must be something corrupted in pg_attribute's indexes. > What are the ramifications of continuing with the corrupted indexes - > undefined behavior? I wouldn't recommend it. > Filesystems have fsck to fix stuff - are there any > tools on the docket to reconstruct the indexes or other recoverable > things? I've thought for some time that vacuum ought to just rebuild the indexes from scratch. That'd provide a recovery path for this sort of problem, and I suspect it'd actually be faster than what vacuum does now. I'm not volunteering to make it happen, though. regards, tom lane
Tom Lane wrote: > Kristofer Munn <kmunn@munn.com> writes: > > [PostgreSQL 6.5.2 on i686-pc-linux-gnu, compiled by gcc egcs-2.91.66] > > But that's not very relevant to your real problem, which is that > there must be something corrupted in pg_attribute's indexes. > > > What are the ramifications of continuing with the corrupted indexes - > > undefined behavior? > > I wouldn't recommend it. > > > Filesystems have fsck to fix stuff - are there any > > tools on the docket to reconstruct the indexes or other recoverable > > things? > > I've thought for some time that vacuum ought to just rebuild the indexes > from scratch. That'd provide a recovery path for this sort of problem, > and I suspect it'd actually be faster than what vacuum does now. I'm > not volunteering to make it happen, though. I don't know if you could drop/rebuild an index on a system catalog while the database is online. But there was sometimes a utility called reindexdb. That used the bootstrap processing mode interface of the backend to drop and recreate all system catalog indices. I don't know who removed that and why, neither do I know if it would still be possible to drop and reindex through the bootstrap interface. Does someone remember why it's gone? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> > Hmm. If that trace is from 6.5 code, then postgres.c should certainly > be calling proc_exit after the recv() fails. I wonder if proc_exit is > returning because proc_exit_inprogress is nonzero? proc_exit's use of > elog(ERROR) does look mighty bogus to me --- that path could possibly > cause a recursion just like this, but how did the code get into it to > begin with? The proc_exit_inprogress stuff was added by me after I found some backends doing exactly that sort of infinite recursion after a socket recv error. It doesn't correct the original error but at least il will exit the backend after 10 iterations. The elog(ERROR) might be bogus in this context, but how can you otherwise notify the error? Maybe a better solution could be this: if (proc_exit_inprogress++ == 9) elog(ERROR, "infinite recursion in proc_exit");if (proc_exit_inprogress >= 9) gotoexit; -- Massimo Dal Zotto +----------------------------------------------------------------------+ | Massimo Dal Zotto email: dz@cs.unitn.it | | Via Marconi, 141 phone: ++39-0461534251 | | 38057 Pergine Valsugana (TN) www: http://www.cs.unitn.it/~dz/ | | Italy pgp: finger dz@tango.cs.unitn.it | +----------------------------------------------------------------------+
> > > > Hmm. If that trace is from 6.5 code, then postgres.c should certainly > > be calling proc_exit after the recv() fails. I wonder if proc_exit is > > returning because proc_exit_inprogress is nonzero? proc_exit's use of > > elog(ERROR) does look mighty bogus to me --- that path could possibly > > cause a recursion just like this, but how did the code get into it to > > begin with? > > The proc_exit_inprogress stuff was added by me after I found some backends > doing exactly that sort of infinite recursion after a socket recv error. > It doesn't correct the original error but at least il will exit the backend > after 10 iterations. The elog(ERROR) might be bogus in this context, but how > can you otherwise notify the error? Maybe a better solution could be this: > > if (proc_exit_inprogress++ == 9) > elog(ERROR, "infinite recursion in proc_exit"); > if (proc_exit_inprogress >= 9) > goto exit; Fix applied: /* * If proc_exit is called too many times something bad is happening, so * exit immediately. This is crafted intwo if's for a reason. */ if (proc_exit_inprogress == 9) elog(ERROR, "infinite recursion in proc_exit"); if(proc_exit_inprogress >= 9) goto exit; -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Fix applied: > /* > * If proc_exit is called too many times something bad is happening, so > * exit immediately. This is crafted in two if's for a reason. > */ > if (proc_exit_inprogress == 9) > elog(ERROR, "infinite recursion in proc_exit"); > if (proc_exit_inprogress >= 9) > goto exit; That isn't going to make things any better, because it's still laboring under the same basic oversight: elog(ERROR) does not return to the caller, it returns control to the main loop. Thus, having proc_exit call elog(ERROR) is simply *guaranteed* to create a failure. proc_exit must not allow control to return anywhere, under any circumstances, because it is used as the final recourse when things are too screwed up to consider continuing. Although it's not hard to remove this silliness from proc_exit itself, I suspect that the real source of the problem is probably elog(ERROR) being called by one of the on_shmem_exit or on_proc_exit routines that proc_exit is supposed to call. I don't think we can hunt down and eliminate all possible paths where that could happen (even if we could do it today, it's too likely that future code changes would make it possible again). I think what we must do instead is to fix things so that if an on_shmem_exit/on_proc_exit routine elog's, control comes back to proc_exit and we carry on exiting with the remaining on_shmem_exit/on_proc_exit routines (*not* calling the one that failed again, of course). A sketch of the necessary changes is: 1. proc_exit_inprogress becomes a global, and we change the setjmp-catching code in postgres.c so that if proc_exit_inprogress is nonzero, it just calls proc_exit() immediately. This allows proc_exit to recover control after an on_shmem_exit/on_proc_exit routine fails via elog(). (Note: for elog(FATAL), it's already true that elog passes control straight off to proc_exit.) 2. proc_exit and shmem_exit should decrement on_proc_exit_index and on_shmem_exit_index as they work through the lists of registered routines --- in effect, each routine is removed from the list just before it is called. That prevents the same routine from being called again if it fails. 3. Ensure that the final exit() code is nonzero if we cycled through proc_exit more than once --- errors during proc_exit should always be treated as the equivalent of an outright crash, forcing a postmaster cleanup cycle, I think. Before I try to implement this scheme, can anyone spot a problem with it? regards, tom lane
> That isn't going to make things any better, because it's still laboring > under the same basic oversight: elog(ERROR) does not return to the > caller, it returns control to the main loop. Thus, having proc_exit > call elog(ERROR) is simply *guaranteed* to create a failure. proc_exit > must not allow control to return anywhere, under any circumstances, > because it is used as the final recourse when things are too screwed up > to consider continuing. Massimo pointed out some problems in my fix. The new code is: if (++proc_exit_inprogress == 9) elog(ERROR, "infinite recursion in proc_exit"); if (proc_exit_inprogress >=9) goto exit; /* ---------------- * if proc_exit_inprocess > 1, then it means that we * are being invokedfrom within an on_exit() handler * and so we return immediately to avoid recursion. * ---------------- */ if (proc_exit_inprogress > 1) return; -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > /* ---------------- > * if proc_exit_inprocess > 1, then it means that we > * are being invoked from within an on_exit() handler > * and so we return immediately to avoid recursion. > * ---------------- > */ > if (proc_exit_inprogress > 1) > return; No, no, no, noooo!!! proc_exit MUST NOT RETURN. EVER, UNDER ANY CIRCUMSTANCES. If it does, that means that elog(STOP) can return under some circumstances. The callers of elog() are not expecting that, and they are likely to screw things up even worse if elog returns control unexpectedly. AFAICS, this set of problems cannot be fixed by localized patching in proc_exit. We have to globally change the way in which errors are processed after proc_exit has begun execution. regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > /* ---------------- > > * if proc_exit_inprocess > 1, then it means that we > > * are being invoked from within an on_exit() handler > > * and so we return immediately to avoid recursion. > > * ---------------- > > */ > > if (proc_exit_inprogress > 1) > > return; > > No, no, no, noooo!!! > > proc_exit MUST NOT RETURN. EVER, UNDER ANY CIRCUMSTANCES. > > If it does, that means that elog(STOP) can return under some > circumstances. The callers of elog() are not expecting that, > and they are likely to screw things up even worse if elog returns > control unexpectedly. > > AFAICS, this set of problems cannot be fixed by localized patching in > proc_exit. We have to globally change the way in which errors are > processed after proc_exit has begun execution. > Oh, well. I tried. Code is better than it used to be at least. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026