Thread: Curious buildfarm failures

Curious buildfarm failures

From
Tom Lane
Date:
Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
a few buildfarm failures along the lines of  -- Commit table drop COMMIT PREPARED 'regress-two';
! PANIC:  failed to re-find shared proclock object
! PANIC:  failed to re-find shared proclock object
! connection to server was lost

Evidently I bollixed something, but what?  I've been unable to reproduce
this locally so far.  Anybody see what's wrong?

Another thing is that dugong has been reproducibly failing with
drop cascades to table testschema.atable -- Should succeed DROP TABLESPACE testspace;
+ ERROR:  tablespace "testspace" is not empty

since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
some local problem there but I'm suspicious that there's a connection.
But what?

Any insights out there?
        regards, tom lane



Re: Curious buildfarm failures

From
Andres Freund
Date:
On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> a few buildfarm failures along the lines of
>   
>   -- Commit table drop
>   COMMIT PREPARED 'regress-two';
> ! PANIC:  failed to re-find shared proclock object
> ! PANIC:  failed to re-find shared proclock object
> ! connection to server was lost
> 
> Evidently I bollixed something, but what?  I've been unable to reproduce
> this locally so far.  Anybody see what's wrong?
> 
> Another thing is that dugong has been reproducibly failing with
> 
>  drop cascades to table testschema.atable
>   -- Should succeed
>   DROP TABLESPACE testspace;
> + ERROR:  tablespace "testspace" is not empty
> 
> since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> some local problem there but I'm suspicious that there's a connection.
> But what?
> 
> Any insights out there?

It also has:

LOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
FATAL:  could not open file "base/16384/28182": No such file or directory
CONTEXT:  writing block 6 of relation base/16384/28182
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)
LOG:  checkpointer process (PID 30366) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
LOG:  abnormal database system shutdown


================== stack trace: pgsql.9958/src/test/regress/tmp_check/data/core ==================
Using host libthread_db library "/lib/tls/libthread_db.so.1".

warning: Can't read pathname for load map: Input/output error.
Core was generated by `postgres: checkpointer process                                                '.
Program terminated with signal 6, Aborted.

#0  0xa000000000010620 in __kernel_syscall_via_break ()
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x2000000000953bb0 in raise () from /lib/tls/libc.so.6.1
#2  0x2000000000956df0 in abort () from /lib/tls/libc.so.6.1
#3  0x4000000000b4b510 in ExceptionalCondition (   conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
errorType=0x4000000000d06500"FailedAssertion",    fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at
assert.c:54
#4  0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
#5  0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221
#6  0x40000000007c4fa0 in proc_exit_prepare (code=1) at ipc.c:181
#7  0x40000000007c4ab0 in proc_exit (code=1) at ipc.c:96
#8  0x4000000000b5d390 in errfinish (dummy=0) at elog.c:518
#9  0x4000000000823380 in _mdfd_getseg (reln=0x6000000000155420,    forknum=1397792, blkno=6, skipFsync=0 '\0',
behavior=EXTENSION_FAIL)  at md.c:577
 
#10 0x400000000081e5c0 in mdwrite (reln=0x6000000000155420,    forknum=MAIN_FORKNUM, blocknum=6,
buffer=0x2000000001432ea0"",    skipFsync=0 '\0') at md.c:735
 
#11 0x4000000000824690 in smgrwrite (reln=0x6000000000155420,    forknum=MAIN_FORKNUM, blocknum=6,
buffer=0x2000000001432ea0"",    skipFsync=0 '\0') at smgr.c:534
 
#12 0x400000000079e510 in FlushBuffer (buf=0x1, reln=0x6000000000155420)   at bufmgr.c:1941
#13 0x40000000007a10b0 in SyncOneBuffer (buf_id=0, skip_recently_used=0 '\0')   at bufmgr.c:1677
#14 0x40000000007a0c00 in CheckPointBuffers (flags=5) at bufmgr.c:1284
#15 0x40000000001fcbf0 in CheckPointGuts (checkPointRedo=80827000, flags=5)   at xlog.c:7391
#16 0x40000000001fb2a0 in CreateCheckPoint (flags=5) at xlog.c:7240
#17 0x40000000001f6820 in ShutdownXLOG (code=14699520,    arg=4611686018440093920) at xlog.c:6823
#18 0x400000000072d780 in _setjmp_lpad_CheckpointerMain_0$0$18 ()   at checkpointer.c:413
#19 0x4000000000235810 in AuxiliaryProcessMain (argc=496536,    argv=0x60000fffff80e520) at bootstrap.c:433
#20 0x40000000007172b0 in StartChildProcess (type=508288) at postmaster.c:4956
#21 0x4000000000713f50 in reaper (postgres_signal_arg=30365)   at postmaster.c:2568
#22 <signal handler called>
#23 0xa000000000010620 in __kernel_syscall_via_break ()
#24 0x2000000000953f70 in sigprocmask () from /lib/tls/libc.so.6.1
#25 0x4000000000720480 in ServerLoop () at postmaster.c:1521
#26 0x400000000071d9d0 in PostmasterMain (argc=6, argv=0x60000000000d85e0)   at postmaster.c:1244
#27 0x4000000000577a30 in main (argc=6, argv=0x60000000000d8010) at main.c:197

in the log. So it seems like it also could be related to locking
changes although I don't immediately see why.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Curious buildfarm failures

From
Andres Freund
Date:
On 2013-01-14 22:50:16 +0100, Andres Freund wrote:
> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> > Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> > a few buildfarm failures along the lines of
> >
> >   -- Commit table drop
> >   COMMIT PREPARED 'regress-two';
> > ! PANIC:  failed to re-find shared proclock object
> > ! PANIC:  failed to re-find shared proclock object
> > ! connection to server was lost
> >
> > Evidently I bollixed something, but what?  I've been unable to reproduce
> > this locally so far.  Anybody see what's wrong?
> >
> > Another thing is that dugong has been reproducibly failing with
> >
> >  drop cascades to table testschema.atable
> >   -- Should succeed
> >   DROP TABLESPACE testspace;
> > + ERROR:  tablespace "testspace" is not empty
> >
> > since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> > some local problem there but I'm suspicious that there's a connection.
> > But what?
> >
> > Any insights out there?
>
> It also has:
>
> FATAL:  could not open file "base/16384/28182": No such file or directory
> CONTEXT:  writing block 6 of relation base/16384/28182
> TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)

> #3  0x4000000000b4b510 in ExceptionalCondition (
>     conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
>     errorType=0x4000000000d06500 "FailedAssertion",
>     fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
> #4  0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
> #5  0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221
>
> in the log. So it seems like it also could be related to locking
> changes although I don't immediately see why.

No such "luck" though, the animal only applied the elog commits:
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Curious buildfarm failures

From
Heikki Linnakangas
Date:
On 14.01.2013 23:35, Tom Lane wrote:
> Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> a few buildfarm failures along the lines of
>
>    -- Commit table drop
>    COMMIT PREPARED 'regress-two';
> ! PANIC:  failed to re-find shared proclock object
> ! PANIC:  failed to re-find shared proclock object
> ! connection to server was lost
>
> Evidently I bollixed something, but what?  I've been unable to reproduce
> this locally so far.  Anybody see what's wrong?

I was able to reproduce this by setting max_locks_per_transaction and 
max_connections to the minimum. My assumption is that there's something 
wrong in the way hash_update_hash_key() handles collisions.

- Heikki



Re: Curious buildfarm failures

From
Heikki Linnakangas
Date:
On 15.01.2013 00:14, Heikki Linnakangas wrote:
> On 14.01.2013 23:35, Tom Lane wrote:
>> Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
>> a few buildfarm failures along the lines of
>>
>> -- Commit table drop
>> COMMIT PREPARED 'regress-two';
>> ! PANIC: failed to re-find shared proclock object
>> ! PANIC: failed to re-find shared proclock object
>> ! connection to server was lost
>>
>> Evidently I bollixed something, but what? I've been unable to reproduce
>> this locally so far. Anybody see what's wrong?
>
> I was able to reproduce this by setting max_locks_per_transaction and
> max_connections to the minimum. My assumption is that there's something
> wrong in the way hash_update_hash_key() handles collisions.

The problem seems to be when the the old and the key hash to the same
bucket. In that case, hash_update_hash_key() tries to link the entry to
itself. The attached patch fixes it for me.

- Heikki

Attachment

Re: Curious buildfarm failures

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> The problem seems to be when the the old and the key hash to the same 
> bucket. In that case, hash_update_hash_key() tries to link the entry to 
> itself. The attached patch fixes it for me.

Doh!  I had a feeling that that needed a special case, but didn't think
hard enough.  Thanks.

I think the patch could do with more than no comment, but will fix
that and apply.
        regards, tom lane



Re: Curious buildfarm failures

From
Andres Freund
Date:
On 2013-01-14 22:56:47 +0100, Andres Freund wrote:
> On 2013-01-14 22:50:16 +0100, Andres Freund wrote:
> > On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> > > Since commit 2065dd2834e832eb820f1fbcd16746d6af1f6037, there have been
> > > a few buildfarm failures along the lines of
> > >
> > >   -- Commit table drop
> > >   COMMIT PREPARED 'regress-two';
> > > ! PANIC:  failed to re-find shared proclock object
> > > ! PANIC:  failed to re-find shared proclock object
> > > ! connection to server was lost
> > >
> > > Evidently I bollixed something, but what?  I've been unable to reproduce
> > > this locally so far.  Anybody see what's wrong?
> > >
> > > Another thing is that dugong has been reproducibly failing with
> > >
> > >  drop cascades to table testschema.atable
> > >   -- Should succeed
> > >   DROP TABLESPACE testspace;
> > > + ERROR:  tablespace "testspace" is not empty
> > >
> > > since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> > > some local problem there but I'm suspicious that there's a connection.
> > > But what?
> > >
> > > Any insights out there?
> >
> > It also has:
> >
> > FATAL:  could not open file "base/16384/28182": No such file or directory
> > CONTEXT:  writing block 6 of relation base/16384/28182
> > TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1743)
> 
> > #3  0x4000000000b4b510 in ExceptionalCondition (
> >     conditionName=0x4000000000d76390 "!(PrivateRefCount[i] == 0)",
> >     errorType=0x4000000000d06500 "FailedAssertion",
> >     fileName=0x4000000000d76260 "bufmgr.c", lineNumber=1743) at assert.c:54
> > #4  0x40000000007a7d20 in AtProcExit_Buffers (code=1, arg=59) at bufmgr.c:1743
> > #5  0x40000000007c4e50 in shmem_exit (code=1) at ipc.c:221
> >
> > in the log. So it seems like it also could be related to locking
> > changes although I don't immediately see why.
> 
> No such "luck" though, the animal only applied the elog commits:
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=dugong&dt=2013-01-14%2000%3A00%3A02&stg=SCM-checkout

Do you have idea whats going on? I don't really have any clue other than
guessing it might be an compiler bug.

Could the buildfarm owner perhaps tell us which files are left in the
tablespace 'testspace'?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Curious buildfarm failures

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
>>>> Another thing is that dugong has been reproducibly failing with
>>>> 
>>>> drop cascades to table testschema.atable
>>>> -- Should succeed
>>>> DROP TABLESPACE testspace;
>>>> + ERROR:  tablespace "testspace" is not empty
>>>> 
>>>> since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
>>>> some local problem there but I'm suspicious that there's a connection.
>>>> But what?

> Do you have idea whats going on? I don't really have any clue other than
> guessing it might be an compiler bug.

I'm suspicious of that too, but it's hard to see why it would manifest
like this while everything else appears to be fine.  A codegen bug
triggered by elog ought to show up in a lot of places, one would think.

> Could the buildfarm owner perhaps tell us which files are left in the
> tablespace 'testspace'?

I'm betting the answer is "none", and that what's happening is some sort
of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
checkpointer process to clean out all the just-dropped files).  The
PANIC at shutdown looks like it might be some sort of doppelganger of
the same bug, ie dropped table cleaned out too early.
        regards, tom lane



Re: Curious buildfarm failures

From
Andrew Dunstan
Date:
On 01/15/2013 11:04 AM, Andres Freund wrote:
> Could the buildfarm owner perhaps tell us which files are left in the 
> tablespace 'testspace'?


They will not be able to easily - the workspace is normally cleared out 
at the end of each run.

cheers

andrew



Re: Curious buildfarm failures

From
Andres Freund
Date:
On 2013-01-15 11:19:28 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> >>> On 2013-01-14 16:35:48 -0500, Tom Lane wrote:
> >>>> Another thing is that dugong has been reproducibly failing with
> >>>> 
> >>>> drop cascades to table testschema.atable
> >>>> -- Should succeed
> >>>> DROP TABLESPACE testspace;
> >>>> + ERROR:  tablespace "testspace" is not empty
> >>>> 
> >>>> since the elog-doesn't-return patch (b853eb97) went in.  Maybe this is
> >>>> some local problem there but I'm suspicious that there's a connection.
> >>>> But what?
> 
> > Do you have idea whats going on? I don't really have any clue other than
> > guessing it might be an compiler bug.
> 
> I'm suspicious of that too, but it's hard to see why it would manifest
> like this while everything else appears to be fine.  A codegen bug
> triggered by elog ought to show up in a lot of places, one would think.

The make output showed that for some files optimization were disabled by
the compiler because they were to complex. Its possible that it is
related to that :(.

Interestingly the compiler couldn't deduce that
e.g. DateTimeParseError() didn't return with the old ereport definition
but it can with the new one which seems strange.

> > Could the buildfarm owner perhaps tell us which files are left in the
> > tablespace 'testspace'?
> 
> I'm betting the answer is "none", and that what's happening is some sort
> of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
> checkpointer process to clean out all the just-dropped files).  The
> PANIC at shutdown looks like it might be some sort of doppelganger of
> the same bug, ie dropped table cleaned out too early.

It seems strange tho that it started failing exactly with that commit
and all runs failed exactly the same way since.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Curious buildfarm failures

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> Interestingly the compiler couldn't deduce that
> e.g. DateTimeParseError() didn't return with the old ereport definition
> but it can with the new one which seems strange.

Oooh, I hadn't noticed that.  I guess that must indicate that this
version of icc can deduce unreachability from
if (true)    abort();

but not from
(true) ? abort() : (void) 0;

which is a bit odd but not incredible.  (I had actually wondered while
working on the patch if this wording might be better for some compilers;
seems that's the case.)

What that means is that this compiler was not previously aware that
either ereport(ERROR) or elog(ERROR) doesn't return, but it now knows
that for both.  So that greatly expands the scope of places where
behavior might have changed.  Doesn't simplify our task :-(

>> I'm betting the answer is "none", and that what's happening is some sort
>> of timing problem (ie, the DROP TABLESPACE somehow isn't waiting for the
>> checkpointer process to clean out all the just-dropped files).

> It seems strange tho that it started failing exactly with that commit
> and all runs failed exactly the same way since.

Well, it could be quite reproducible, if for example what's happening is
that the DROP is failing to wait for the checkpointer at all.

Is there a way to enable log_checkpoints during a buildfarm run?
It'd be good to get timestamps added to the postmaster log entries, too.
        regards, tom lane



Re: Curious buildfarm failures

From
Andrew Dunstan
Date:
On 01/15/2013 11:57 AM, Tom Lane wrote:
> Well, it could be quite reproducible, if for example what's happening is
> that the DROP is failing to wait for the checkpointer at all.
>
> Is there a way to enable log_checkpoints during a buildfarm run?
> It'd be good to get timestamps added to the postmaster log entries, too.
>
>             


Yes, it's very easy. In the config file, do something like:
         extra_config =>         {             DEFAULT => [                      q(log_line_prefix = '%t [%c:%l] '),
                 "log_checkpoints = 'true',                      ],         },
 



cheers

andrew



Re: Curious buildfarm failures

From
Andrew Dunstan
Date:
On 01/15/2013 12:07 PM, Andrew Dunstan wrote:
>
> On 01/15/2013 11:57 AM, Tom Lane wrote:
>> Well, it could be quite reproducible, if for example what's happening is
>> that the DROP is failing to wait for the checkpointer at all.
>>
>> Is there a way to enable log_checkpoints during a buildfarm run?
>> It'd be good to get timestamps added to the postmaster log entries, too.
>>
>>
>
>
> Yes, it's very easy. In the config file, do something like:

I had a missing quote. should be:
   extra_config =>             {                 DEFAULT => [                          q(log_line_prefix = '%t [%c:%l]
'),                         "log_checkpoints = 'true'",                          ],             },
 

cheers


andrew