Thread: Unhelpful debug tools on OS X :-(

Unhelpful debug tools on OS X :-(

From
Tom Lane
Date:
Dunno if anyone's noticed, but buildfarm member jackal has been crashing
intermittently in the "cluster" regression test.  I found out that I can
reproduce that here on a Mac Mini (apparently the same or nearly same
hardware/software as jackal) --- just repeat the parallel tests often
enough, and it'll fail once in awhile.  I've got a core file but gdb
is being no help at all:

mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster /cores/core.23609
GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done

Core was generated by `/Users/tgl/testversion/bin/postmaster'.
Core file contained no thread-specific data

(gdb) bt
#0  0x00000000 in ?? ()
(gdb)

Any suggestions how to extract some info out of this?
        regards, tom lane


Re: Unhelpful debug tools on OS X :-(

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Dunno if anyone's noticed, but buildfarm member jackal has been crashing
> intermittently in the "cluster" regression test.  I found out that I can
> reproduce that here on a Mac Mini (apparently the same or nearly same
> hardware/software as jackal) --- just repeat the parallel tests often
> enough, and it'll fail once in awhile.  I've got a core file but gdb
> is being no help at all:
> 
> mini:~/pgsql/src/test/regress tgl$ gdb /Users/tgl/testversion/bin/postmaster /cores/core.23609
> GNU gdb 6.3.50-20050815 (Apple version gdb-563) (Wed Jul 19 05:10:58 GMT 2006)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-apple-darwin"...Reading symbols for shared libraries .. done
> 
> Core was generated by `/Users/tgl/testversion/bin/postmaster'.
> Core file contained no thread-specific data
> 
> (gdb) bt
> #0  0x00000000 in ?? ()
> (gdb)
> 
> Any suggestions how to extract some info out of this?

no idea on that one - but clownfish managed to fail the cluster test
once too lately:

http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=clownfish&dt=2007-04-09%2023:03:03


Stefan


Re: Unhelpful debug tools on OS X :-(

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Any suggestions how to extract some info out of this?

Does OS X have the catchsegv tool? If you can run postmaster with that, 
you should get a backtrace when it crashes. Unless it has the same 
problem as gdb, of course..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Unhelpful debug tools on OS X :-(

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Any suggestions how to extract some info out of this?

> Does OS X have the catchsegv tool?

No, but I suddenly remembered about CrashReporter, and sure enough it's
catching these crashes:

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x00000010

Thread 0 Crashed:
0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
3   postmaster  0x000be119 cluster + 160 (cluster.c:220)
4   postmaster  0x001b74a8 PortalRunUtility + 233 (palloc.h:84)
5   postmaster  0x001b7784 PortalRunMulti + 237 (pquery.c:1271)
6   postmaster  0x001b80ae PortalRun + 918 (pquery.c:813)
7   postmaster  0x001b2afd exec_simple_query + 656 (postgres.c:965)
8   postmaster  0x001b4b0c PostgresMain + 5628 (postgres.c:3507)
9   postmaster  0x00183973 ServerLoop + 2828 (postmaster.c:2614)
10  postmaster  0x00184b1e PostmasterMain + 2794 (postmaster.c:972)
11  postmaster  0x00130f8e main + 1236 (main.c:188)
12  postmaster  0x00001e86 _start + 216
13  postmaster  0x00001dad start + 41

So it looks like this has got something to do with the MVCC-safe cluster
changes, which is not too surprising considering it started happening
around about then.  Off to have a look ...
        regards, tom lane


Re: Unhelpful debug tools on OS X :-(

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> Any suggestions how to extract some info out of this?
> 
>> Does OS X have the catchsegv tool?
> 
> No, but I suddenly remembered about CrashReporter, and sure enough it's
> catching these crashes:
> 
> Exception:  EXC_BAD_ACCESS (0x0001)
> Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x00000010
> 
> Thread 0 Crashed:
> 0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
> 1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
> 2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
> 3   postmaster  0x000be119 cluster + 160 (cluster.c:220)
> 4   postmaster  0x001b74a8 PortalRunUtility + 233 (palloc.h:84)
> 5   postmaster  0x001b7784 PortalRunMulti + 237 (pquery.c:1271)
> 6   postmaster  0x001b80ae PortalRun + 918 (pquery.c:813)
> 7   postmaster  0x001b2afd exec_simple_query + 656 (postgres.c:965)
> 8   postmaster  0x001b4b0c PostgresMain + 5628 (postgres.c:3507)
> 9   postmaster  0x00183973 ServerLoop + 2828 (postmaster.c:2614)
> 10  postmaster  0x00184b1e PostmasterMain + 2794 (postmaster.c:972)
> 11  postmaster  0x00130f8e main + 1236 (main.c:188)
> 12  postmaster  0x00001e86 _start + 216
> 13  postmaster  0x00001dad start + 41
> 
> So it looks like this has got something to do with the MVCC-safe cluster
> changes, which is not too surprising considering it started happening
> around about then.  Off to have a look ...

I've been looking at the code for a few minutes as well, but haven't 
found an explanation for that yet.

But I did notice that we're not fsyncing the newly written relation like 
we should. There's a comment raw_heap_insert:/* * Now write the page. We say isTemp = true even if it's not a * temp
table,because there's no need for smgr to schedule an * fsync for this write; we'll do it ourselves before committing.
*/smgrextend(state->rs_new_rel->rd_smgr,state->rs_blockno,           (char *) page, true);
 

That's copy-pasted from tablecmds.c. But unlike in tablecmds.c, 
end_heap_rewrite only fsyncs the new file if we're not WAL-logging. 
Proposed fix:

Index: src/backend/access/heap/rewriteheap.c
===================================================================
RCS file: 
/home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/heap/rewriteheap.c,v
retrieving revision 1.1
diff -c -r1.1 rewriteheap.c
*** src/backend/access/heap/rewriteheap.c       8 Apr 2007 01:26:27 
-0000      1.1
--- src/backend/access/heap/rewriteheap.c       17 Apr 2007 20:50:05 -0000
***************
*** 272,282 ****        }
        /*
!        * If not WAL-logging, must fsync before commit.  We use heap_sync
!        * to ensure that the toast table gets fsync'd too.         */
!       if (!state->rs_use_wal)
!               heap_sync(state->rs_new_rel);
        /* Deleting the context frees everything */        MemoryContextDelete(state->rs_cxt);
--- 272,284 ----        }
        /*
!        * Must fsync before commit, even if we've WAL-logged the changes,
!        * because we've written pages outside the buffer manager.  See 
comments!        * in copy_relation_data in commands/tablecmds.c for 
more information.
!        *
!        * We use heap_sync to ensure that the toast table gets fsync'd too.         */
!       heap_sync(state->rs_new_rel);
        /* Deleting the context frees everything */        MemoryContextDelete(state->rs_cxt);


BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not 
heap_sync. How about the toast table, it goes through shared buffers as 
usual, right?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Unhelpful debug tools on OS X :-(

From
Tom Lane
Date:
I wrote:
> Thread 0 Crashed:
> 0   postmaster  0x001af4ef smgrextend + 12 (smgr.c:485)
> 1   postmaster  0x00029044 end_heap_rewrite + 208 (rewriteheap.c:278)
> 2   postmaster  0x000bdc22 cluster_rel + 850 (cluster.c:806)
> 3   postmaster  0x000be119 cluster + 160 (cluster.c:220)

AFAICS, a crash inside smgrextend must indicate it was passed a junk
SMgrRelation, which implies that the problem has to trace back to
the new heap's rd_smgr getting closed since the rewriteheap operation
started.  I think the only possible path for that is an sinval queue
overflow and consequent SI reset during the test.  That would explain
why it's such an intermittent failure, but it is curious that Darwin
seems much more prone to this than other platforms.  I wonder if Apple
has tweaked the scheduler in a way that allows backends to not be given
any cycles for long periods ...

Anyway, patch committed; we'll see if things don't get more stable.

Memo to self: run some tests with CLOBBER_CACHE_ALWAYS after the
remaining major patches have landed.
        regards, tom lane


Re: Unhelpful debug tools on OS X :-(

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> But I did notice that we're not fsyncing the newly written relation like 
> we should.

Good point, but doesn't the analogy to copy_relation_data say that we
should sync if not rd_istemp?  (This is my fault BTW; your original
patch kept the data in shared buffers, so it wasn't subject to the
problem.)

> BTW: In tablecmds.c the new relation is fsynced with smgrimmedsync, not 
> heap_sync.

That's okay since that routine is just copying the one table.  TOAST is
handled via recursion of ATExecSetTableSpace.
        regards, tom lane


Re: Unhelpful debug tools on OS X :-(

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> But I did notice that we're not fsyncing the newly written relation like 
>> we should.
> 
> Good point, but doesn't the analogy to copy_relation_data say that we
> should sync if not rd_istemp? 

Right. I don't know why I didn't copy that if-statement. It seems to be 
bedtime for me...

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com