Thread: Build farm failure

Build farm failure

From
Gregory Stark
Date:
dugong (icc on ia64) has been failing the contrib installcheck consistently
since 6 days ago with errors like:

ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory

I checked a cvs diff between the two timestamps and that's precisely when the
self-adjusting bgwriter changes went in which knocked around bgwriter's
checkpoint logic quite a bit so it seems likely this is a real bug.

On the other hand it seems weird that it only occurs in contrib's check and
not the normal tests. And it also seems weird it only happens on icc ia64 and
not any other architecture under the sun. I installed icc on my machine (ia32)
but didn't get the same problem (we don't seem to have a recent icc ia32 build
farm member).

The only things I know about icc are that it likes to pad structs
unnecessarily and is picky about its -f arguments...


LOG:  database system was shut down at 2007-10-02 02:21:38 MSD
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
NOTICE:  database "contrib_regression" does not exist, skipping
NOTICE:  type "gbtreekey4" is not yet defined
DETAIL:  Creating a shell type definition.
NOTICE:  argument type gbtreekey4 is only a shell
NOTICE:  type "gbtreekey8" is not yet defined
DETAIL:  Creating a shell type definition.
NOTICE:  argument type gbtreekey8 is only a shell
NOTICE:  type "gbtreekey16" is not yet defined
DETAIL:  Creating a shell type definition.
NOTICE:  argument type gbtreekey16 is only a shell
NOTICE:  type "gbtreekey32" is not yet defined
DETAIL:  Creating a shell type definition.
NOTICE:  argument type gbtreekey32 is only a shell
NOTICE:  type "gbtreekey_var" is not yet defined
DETAIL:  Creating a shell type definition.
NOTICE:  argument type gbtreekey_var is only a shell
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0
NOTICE:  database "contrib_regression" does not exist, skipping
ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
ERROR:  checkpoint request failed
HINT:  Consult recent messages in the server log for details.
STATEMENT:  CREATE DATABASE "contrib_regression" TEMPLATE=template0


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Build farm failure

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> dugong (icc on ia64) has been failing the contrib installcheck consistently
> since 6 days ago with errors like:
>
> ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory
>
> I checked a cvs diff between the two timestamps and that's precisely when the
> self-adjusting bgwriter changes went in which knocked around bgwriter's
> checkpoint logic quite a bit so it seems likely this is a real bug.

Of course moments after I sent that it finally occurred to me to look further
back in dugong's history and its previous failing periods were with exactly
the same errors. So the bgwriter changes are off the hook.

And given the consistency and the fact that the other icc machines didn't show
the same problems it sounds like it's something about that machine, not a
software problem.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Build farm failure

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> dugong (icc on ia64) has been failing the contrib installcheck consistently
> since 6 days ago with errors like:
> ERROR:  could not fsync segment 0 of relation 1663/40960/41403: No such file or directory

Yeah, I already asked Sergey about this but I guess he's not had time to
poke at it yet:
http://archives.postgresql.org/pgsql-hackers/2007-09/msg01061.php

My theory is that putting an Assert right there is somehow breaking
ForwardFsyncRequest --- maybe it becomes a complete no-op, maybe it
forwards a corrupt request, who knows.  The only way that there'd be
any visible problem from that, if you weren't actually performing
pull-the-power-plug tests, would be that lack of forwarding of "revoke"
requests could lead to the bgwriter attempting to fsync files in
already-dropped databases or tablespaces.  Which matches the visible
symptoms exactly.

This looks like nothing so much as a compiler bug, particularly given
that we're seeing it with only one compiler on only one platform.
We should study it more carefully, both to look for workarounds and
to file a suitable bug report, but I'll be pretty surprised if it's
really our bug.
        regards, tom lane


Re: Build farm failure

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> And given the consistency and the fact that the other icc machines
> didn't show the same problems it sounds like it's something about that
> machine, not a software problem.

Well, we haven't *got* any other icc-on-ia64 machines AFAICS, so it
could easily be a software problem.

Your remark about padding set off an alarm bell on re-reading --- what
if RelFileNode is padded to 16 bytes on that architecture?  Junk in the
padding might break lookups in bgwriter's internal hashtable.  But the
Intel docs I can find do not suggest any such thing.  Could someone
confirm what sizeof(RelFileNode) is on ia64?
        regards, tom lane


Re: Build farm failure

From
Jeremy Drake
Date:
On Tue, 2 Oct 2007, Gregory Stark wrote:

> (we don't seem to have a recent icc ia32 build farm member).

Sorry about that, my buildfarm member (mongoose) is down with hardware
problems, and probably will be for the forseeable future.  For some
reason, it suddenly decided to stop recognizing its RAID card...

-- 
In the beginning was the word.
But by the time the second word was added to it,
there was trouble.
For with it came syntax ...    -- John Simon