Thread: Re: Bug#41277: postgresql 6.5.1-3 + sparc (sun4u) == nasty nasty crashes

Re: Bug#41277: postgresql 6.5.1-3 + sparc (sun4u) == nasty nasty crashes

From
"Oliver Elphick"
Date:
[postgresql lists added to Cc in hope of elucidation]

Adam Di Carlo wrote:
  >
  >[Background: PostgreSQL is causing extremely hard crashes on my Sun4u
  >(Ultra5) Debian SPARC system.  Anyone should be able to reproduce this
  >by installing the postgresql-test environment, and running:
  >
  >  # cd /usr/lib/postgresql/test/regress
  >  # chown -R postgres .
  >  # su - postgres
  >  $ cd /usr/lib/postgresql/test/regress
  >  $ make runtest
  >
  >BEWARE -- this hard crashes my system.  You may crash hard; you may
  >lose data.
  >
  >Note: I am running a mostly up-to-date 2.2.9 kernel (stock image from
  >potato) with the newest postgresql package (6.5.1-3 I believe).
  >]
  >
  >>That is very nasty  --  and unexpected; I would like to report whatever
  >>information is available to pgsql-ports@postgresql.org.  However, they
  >>will need to know exactly what was going on - logfile output, if available,
  >>progress through the test, test output file, if it survived.  It doesn't
  >>seem at all like the problem that I thought I was asking you to look at.
  >>We should investigate whether there is some entirely separate cause.
  >
  >Yes.  On followup, I am getting intermittant hard crashes when running
  >regress.sh or doing any operation with postgresql.  Obviously, this is
  >more on the level of a sparc64 kernel problem, even, than a purely
  >postgres problem -- after all, no user process should be able to take
  >out the system this way.

I regret that I have no experience with kernel debugging.

  >
  >My most recent crash has this output to 'make runtest':
  >
  >path .. ok
  >polygon .. ok
  >circle .. ok
  >geometry .. failed
  >timespan ..
  >
  >And in the postgres.log, with debugging at 4:
  >
  >plan:
  >
  >{ SEQSCAN :cost 43 :size 334 :width 16 :state <> :qptargetlist
  >({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 705 :restypmod -1
  >:resname "one" :reskey 0 :reskeyop 0 :resgroupref 0 :resjunk false }
  >:expr { CONST :consttype 705 :constlen -1 :constisnull false
  >:constvalue  4 [  0  0  0  4 ]  :constbyval false }} { TARGETENTRY
  >:resdom { RESDOM :resno 2 :restype 600 :restypmod -1 :resname "f1"
  >:reskey 0 :reskeyop 0 :resgroupref 0 :resjunk false } :expr { VAR
  >:varno 1 :varattno 1 :vartype 600 :vartypmod -1  :varlevelsup 0
  >:varnoold 1 :varoattno 1}}) :qpqual ({ EXPR :typeOid 16  :opType func
  >:oper { FUNC :funcid 1532 :functype 16 :funcisindex false :funcsize 0
  >:func_fcache @ 0x0 :func_tlist ({ TARGETENTRY :resdom { RESDOM :resno
  >1 :restype 16 :restypmod -1 :resname "<noname>" :reskey 0 :reskeyop 0
  >:resgroupref 0 :resjunk false } :expr { VAR :varno -1 :varattno 1
  >:vartype 16 :vartypmod -1  :varlevelsup 0 :varnoold -1 :varoattno 1}})
  >:func_planlist <>} :args ({ VAR :varno 1 :varattno 1 :vartype 600
  >:vartypmod -1  :varlevelsup 0 :varnoold 1 :varoattno 1} { CONST
  >:consttype 600 :constlen 16 :constisnull false :constvalue  16 [  64
  >20  102  102  102  102  102  102  64  65  64  0  0  0  0  0 ]
  >:constbyval false })}) :lefttree <> :righttree <> :extprm () :locprm
  >() :initplan <> :nprm 0  :scanrelid 1 }
  >
  >ProcessQuery
  >CommitTransactionCommand
  >StartTransactionCommand
  >query: SELECT '' AS one, p1.f1
  >   FROM POINT_TBL p1
  >   WHERE p1.f1 ?| '(5.1,34.5)'::point;
  >parser outputs:
  >
  >{ QUERY :command 1  :utility <> :resultRelation 0 :into <> :isPortal
  >false :isBinary false :isTemp false :unionall false  :unique <>
  >:sortClause <> :rtable ({ RTE :relname point_tbl :refname p1 :relid
  >20864 :inh false :inFromCl true :skipAcl false}) :targetlist
  >({ TARGETENTRY :resdom { RESDOM :resno 1 :restype 705 :restypmod -1
  >:resname "one" :reskey 0 :reskeyop 0 :resgroupref 0 :resjunk false }
  >:expr { CONST :consttype 705 :constlen -1 :constisnull false
  >:constvalue  4 [  0  0  0  4 ]  :constbyval false }} { TARGETENTRY
  >:resdom { RESDOM :resno 2 :restype 600 :restypmod -1
  >:resname "f1" :reskey 0 :reskeyop 0 :resgroupref 0 :resjunk false }
  >:expr { VAR :varno 1 :varattno 1 :vartype 600 :vartypmod -1
  >:varlevelsup 0 :varnoold 1 :varoattno 1}}) :qual { EXPR :typeOid 16
  >:opType op :oper { OPER :opno 809
  >:opid 0 :opresulttype 16 } :args ({ VAR :varno 1 :varattno 1 :vartype
  >
  >------
  >
  >Output just stops there, with a hard crash to the system.

not even a kernel oops output?

  >
  >--
  >.....Adam Di Carlo....adam@onShore.com.....<URL:http://www.onShore.com/>

Can postgresql developers tell from this what routine we are in when the
crash occurs?  I suppose that log output is buffered; where can we turn
off buffering so that all possible output is saved to disk before the
crash?

--
      Vote against SPAM: http://www.politik-digital.de/spam/
                 ========================================
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight                              http://www.lfix.co.uk/oliver
               PGP key from public servers; key ID 32B8FAA1
                 ========================================
     "And why call ye me, Lord, Lord, and do not the things
      which I say?"                   Luke 6:46



"Oliver Elphick" <olly@lfix.co.uk> writes:
>> Yes.  On followup, I am getting intermittant hard crashes when running
>> regress.sh or doing any operation with postgresql.  Obviously, this is
>> more on the level of a sparc64 kernel problem, even, than a purely
>> postgres problem -- after all, no user process should be able to take
>> out the system this way.

Yipes...

> Can postgresql developers tell from this what routine we are in when the
> crash occurs?  I suppose that log output is buffered; where can we turn
> off buffering so that all possible output is saved to disk before the
> crash?

The log is not nearly detailed enough to tell what routine we're in,
even if there weren't the buffering problem.  Also, given that this is
a kernel crash, I'm not sure I'd assume that even fsync() after every
line of output would ensure that the last line made it to disk.

What you really want is a truss or strace log of kernel calls, anyhow,
but there's still the problem of getting it out to disk before the
crash.  Better find a kernel-debugging expert to ask for advice...

            regards, tom lane

Re: [HACKERS] Re: Bug#41277: postgresql 6.5.1-3 + sparc (sun4u) == nasty nasty crashes

From
Michael Alan Dorman
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> What you really want is a truss or strace log of kernel calls, anyhow,
> but there's still the problem of getting it out to disk before the
> crash.  Better find a kernel-debugging expert to ask for advice...

Serial terminal, or printer or some such hooked up to a serial port.

Mike.


>> Can postgresql developers tell from this what routine we are in when the
>> crash occurs?  I suppose that log output is buffered; where can we turn
>> off buffering so that all possible output is saved to disk before the
>> crash?
>
>The log is not nearly detailed enough to tell what routine we're in,
>even if there weren't the buffering problem.  Also, given that this is
>a kernel crash, I'm not sure I'd assume that even fsync() after every
>line of output would ensure that the last line made it to disk.
>
>What you really want is a truss or strace log of kernel calls, anyhow,
>but there's still the problem of getting it out to disk before the
>crash.  Better find a kernel-debugging expert to ask for advice...

Hopefully someone from the sparc or sparc64 team at Debian can look
into this.  I am going on business travel for 4 days so will be away
from any Debian/SPARC machines for a while.

These are the questions which need to be answered:

 * do other people running debian sparc finding the problem, using the
recipe I mentioned in previous email?

 * Is it 2.2.9 specific? Sun4u specific?

 * get strace output as Tom suggests

 * shouldn't we notify the Sparc/Linux folks?

--
.....Adam Di Carlo....adam@onShore.com.....<URL:http://www.onShore.com/>