Thread: BUG #1015: Got a signal 11 while trying to create a temp table

BUG #1015: Got a signal 11 while trying to create a temp table

From
"PostgreSQL Bugs List"
Date:
The following bug has been logged online:

Bug reference:      1015
Logged by:          Aarjan Langereis
Email address:      A.j.langereis@chello.nl
PostgreSQL version: 7.3.4
Operating system:   RedHat Linux 9 kernel 2.4.20-20.9
Description:        Got a signal 11 while trying to create a temp table
Details:

I tried to create a temp table and got my back-end restarting because of a signal 11.

The 3 tables involved:

CREATE TABLE hosts (
        hostID serial primary key,
        hostip cidr NOT NULL,
        hostname varchar(50),
        lastseen timestamp without time zone default ('1970-01-01 01:00'),
        total integer default 0,
        image varchar(20) default 'hosts/unknown.png'
);
CREATE TABLE cpus (
        cpuID integer primary key,
        cpuname varchar(20),
        lastseen timestamp without time zone default ('1970-01-01 01:00'),
        total integer default 0,
        image varchar(20) default 'cpus/unknown.png'
);
CREATE TABLE blocks (
        blockID varchar(30) primary key,
        blockdate timestamp without time zone NOT NULL,
        hostID integer REFERENCES hosts,
        orgIP cidr NOT NULL,
        email varchar(30) NOT NULL,
        osID integer  NOT NULL,
        cpuID integer NOT NULL,
        version integer NOT NULL,
        core integer NOT NULL,
        amount integer NOT NULL
);

Hosts has 205 rows
Cpus has 17 rows
And blocks has 3194409 rows

This is the problem query:

    Create TEMP table tmphosts AS
      select hosts.hostid, hosts.hostip, hosts.hostname, max(blockdate) as lastseen, sum(amount) as total, hosts.image
        from hosts left join blocks on hosts.hostid=blocks.hostid
        group by hosts.hostid, hosts.hostip, hosts.hostname, hosts.image;

But even without the first line is does not work. However this query does work properly:

    Create TEMP table tmpcpus AS
      select cpus.cpuid, cpuname, max(blockdate) as lastseen, sum(amount) as total, image
       from cpus left join blocks on cpus.cpuid=blocks.cpuid
       group by cpus.cpuid, cpuname, image;
They look rather the same to me… But with the first one I got this error in psql:

server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!#

In the log was this:

LOG:  server process (pid 27196) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing shared memory and semaphores
FATAL:  The database system is starting up
LOG:  database system was interrupted at 2003-12-18 19:16:21 CET
LOG:  checkpoint record is at 6/9312CD40
LOG:  redo record is at 6/9312CD40; undo record is at 0/0; shutdown FALSE
LOG:  next transaction id: 28888909; next oid: 15667926
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 6/9312CD80
LOG:  ReadRecord: unexpected pageaddr 6/8B162000 in log file 6, segment 147, offset 1449984
LOG:  redo done at 6/9315EE4C
LOG:  database system is ready

I don’t know what information can be useful to you. But if you need more, please ask!

It seems to me, and please correct me if I’m wrong, that there is a limit to the size that a join can handle.

I hope that the information provided is of any use to you.

Yours,

Aarjan

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
Tom Lane
Date:
"PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
> I tried to create a temp table and got my back-end restarting because of a signal 11.

Hmm.  Can you get a debugger backtrace from the core dump?

> It seems to me, and please correct me if I’m wrong, that there is a limit to the size that a join can handle.

No (and certainly not on a measly 3-million-row case).  This could be a
data corruption problem, or something more subtle, but it's not that.

One way of testing the data-corruption theory is to see if you can
select all the data from the tables involved, without any join.

            regards, tom lane

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
"aarjan langereis"
Date:
How do I get a "debugger backtrace" ?

Selecting all data from the tables involved, does that also include a 'coun=
t(*)', if so, they work:

stats=3D# select count(*) from blocks;
  count
---------
 3194409
(1 row)

stats=3D# select count(*) from hosts;
 count
-------
   205
(1 row)

stats=3D#

Yours,

Aarjan

----- Original Message -----=20
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: <A.j.langereis@chello.nl>
Cc: "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org>
Sent: Friday, December 19, 2003 4:41 AM
Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a tem=
p table=20


> "PostgreSQL Bugs List" <pgsql-bugs@postgresql.org> writes:
> > I tried to create a temp table and got my back-end restarting because o=
f a signal 11.
>=20
> Hmm.  Can you get a debugger backtrace from the core dump?
>=20
> > It seems to me, and please correct me if I=E2?Tm wrong, that there is a=
 limit to the size that a join can handle.
>=20
> No (and certainly not on a measly 3-million-row case).  This could be a
> data corruption problem, or something more subtle, but it's not that.
>=20
> One way of testing the data-corruption theory is to see if you can
> select all the data from the tables involved, without any join.
>=20
> regards, tom lane
>

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
Tom Lane
Date:
"aarjan langereis" <a.j.langereis@inter.nl.net> writes:
> How do I get a "debugger backtrace" ?

Find the "core" file left by the crashed backend --- it should be in
$PGDATA/base/yourdbnumber/core and have a file date equal to the time
of the crash.  If you don't find one, it's likely that the postmaster
was started under "ulimit -c 0" which prevents core dumps.  Restart
the postmaster under "ulimit -c unlimited" and reproduce the crash.

Once you have the core file, do
    $ gdb /path/to/postgres-executable /path/to/core-file
    gdb> bt
    gdb> quit

If bt just produces a list of numbers without any names, it's not going
to be helpful.  In that case you need to rebuild Postgres with debugging
symbols and start over.

There is more info in the archives.


> Selecting all data from the tables involved, does that also include a 'coun=
> t(*)', if so, they work:

Mmm, that really only proves that the page headers and tuple headers are
OK, not that there is not data corruption within some row, because
COUNT(*) won't try to extract any field values from any rows.  I'd
suggest a SELECT * or COPY TO FILE operation to check whether there is
any data corruption.

            regards, tom lane

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
"aarjan langereis"
Date:
I'm using a RedHat machine. In my /etc/init.d/postgresql is this the line
that statsup the postmaster:

su -l postgres -s /bin/sh -c "/usr/bin/pg_ctl -D $PGDATA -p
/usr/bin/postmaster -o '-p ${PGPORT}' start  > /var/log/pgsql.log 2>&1" <
/dev/null

I don't see that "ulimit -c 0".. nowhere in the script. Where do I have to
put the "ulimit -c unlimited"?
Because he is not dumping any core now...

In my first mail I spoke about 2 query's that where rather the same, both
use the 'blocks'-table (3.2M records) and joint it with another one.
One join was with the 'cpus'-table (17 records), this one worked perfect.
The other join was with the 'hosts'-table (205 records), the problem query.
From that and a full look het my 'hosts'-table I can conclude that the data
in the 'blocks'-table is not currupt..

Ok, id some other tests too:

Select * from blocks; gave me the whole table (I didn't look at all records,
but got a result in psql)
select hostid, sum(amount) from blocks group by hostid; crashed (3,2M
records used)
select hostid, sum(amount) from blocks where blockdate::date between
'2003-01-01' and '2003-02-01' group by hostid; worked (also for all other
months!!). (200K- 450K records used)
select hostid, sum(amount) from blocks where blockdate::date between
'2003-01-01' and '2003-07-01' group by hostid; crashed (1390618 records
used)
select hostid, sum(amount) from blocks where blockdate::date between
'2003-02-01' and '2003-07-01' group by hostid; worked (1202952 records used)

To me it seems to be the size of it all...

Yours,

Aarjan

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "aarjan langereis" <a.j.langereis@inter.nl.net>
Cc: <pgsql-bugs@postgresql.org>
Sent: Saturday, December 20, 2003 6:27 PM
Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a temp
table


> "aarjan langereis" <a.j.langereis@inter.nl.net> writes:
> > How do I get a "debugger backtrace" ?
>
> Find the "core" file left by the crashed backend --- it should be in
> $PGDATA/base/yourdbnumber/core and have a file date equal to the time
> of the crash.  If you don't find one, it's likely that the postmaster
> was started under "ulimit -c 0" which prevents core dumps.  Restart
> the postmaster under "ulimit -c unlimited" and reproduce the crash.
>
> Once you have the core file, do
> $ gdb /path/to/postgres-executable /path/to/core-file
> gdb> bt
> gdb> quit
>
> If bt just produces a list of numbers without any names, it's not going
> to be helpful.  In that case you need to rebuild Postgres with debugging
> symbols and start over.
>
> There is more info in the archives.
>
>
> > Selecting all data from the tables involved, does that also include a
'coun=
> > t(*)', if so, they work:
>
> Mmm, that really only proves that the page headers and tuple headers are
> OK, not that there is not data corruption within some row, because
> COUNT(*) won't try to extract any field values from any rows.  I'd
> suggest a SELECT * or COPY TO FILE operation to check whether there is
> any data corruption.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
Tom Lane
Date:
"aarjan langereis" <a.j.langereis@chello.nl> writes:
> I don't see that "ulimit -c 0".. nowhere in the script. Where do I have to
> put the "ulimit -c unlimited"?

I'd suggest adding it to the pg_ctl script.

> Select * from blocks; gave me the whole table (I didn't look at all records,
> but got a result in psql)
> select hostid, sum(amount) from blocks group by hostid; crashed (3,2M
> records used)

Hm.  What do you have sort_mem set to?  Does the EXPLAIN output show
different plans for these queries?

            regards, tom lane

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
"aarjan langereis"
Date:
Ok, just now I understand that 'ulimit' is a bashbuiltin :P. So I fixed that
;)
I reproduced the crash and got a good core-file (with the symbols, see
below). The only thing is: it's 390Mb :( .. even after gzip it's big: 250Mb.
Would that be a problem for you? (I have the bandwith if you want)

(gdb) bt
#0  0x0819b684 in LogicalTapeWrite ()
#1  0x0819da13 in ApplySortFunction ()
#2  0x0819cf10 in tuplesort_getdatum ()
#3  0x080ea656 in ExecSort ()
#4  0x080e07ae in ExecProcNode ()
#5  0x080eaf6f in ExecGroup ()
#6  0x080e07ef in ExecProcNode ()
#7  0x080e5302 in ExecAgg ()
#8  0x080e07bb in ExecProcNode ()
#9  0x080df3a5 in ExecutorEnd ()
#10 0x080de8ac in ExecutorRun ()
#11 0x0813c4bd in ProcessQuery ()
#12 0x0813a898 in pg_exec_query_string ()
#13 0x0813b93f in PostgresMain ()
#14 0x0811f37b in ClosePostmasterPorts ()
#15 0x0811ee03 in ClosePostmasterPorts ()
#16 0x0811dc36 in PostmasterMain ()
#17 0x0811d3a6 in PostmasterMain ()
#18 0x080f5113 in main ()
#19 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
(gdb)


These settings are used:
/proc/sys/kernel/shmmax =3D 500000000
shared_buffers =3D 33000
sort_mem =3D 98304
fsync =3D false

The explain's of both query's:

stats=3D# explain select * from blocks;
                                QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on blocks  (cost=3D100000000.00..100000020.00 rows=3D1000 width=
=3D130)
(1 row)

stats=3D# explain select hostid, sum(amount) from blocks group by hostid;
                                        QUERY PLAN
----------------------------------------------------------------------------
---------------
 Aggregate  (cost=3D100000069.83..100000077.33 rows=3D100 width=3D8)
   ->  Group  (cost=3D100000069.83..100000074.83 rows=3D1000 width=3D8)
         ->  Sort  (cost=3D100000069.83..100000072.33 rows=3D1000 width=3D8)
               Sort Key: hostid
               ->  Seq Scan on blocks  (cost=3D100000000.00..100000020.00
rows=3D1000 width=3D8)
(5 rows)

stats=3D#

I hope this gives you some more info,

Aarjan

----- Original Message -----=20
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "aarjan langereis" <a.j.langereis@chello.nl>
Cc: <pgsql-bugs@postgresql.org>
Sent: Sunday, December 21, 2003 6:13 PM
Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a temp
table


> "aarjan langereis" <a.j.langereis@chello.nl> writes:
> > I don't see that "ulimit -c 0".. nowhere in the script. Where do I have
to
> > put the "ulimit -c unlimited"?
>
> I'd suggest adding it to the pg_ctl script.
>
> > Select * from blocks; gave me the whole table (I didn't look at all
records,
> > but got a result in psql)
> > select hostid, sum(amount) from blocks group by hostid; crashed (3,2M
> > records used)
>
> Hm.  What do you have sort_mem set to?  Does the EXPLAIN output show
> different plans for these queries?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
"aarjan langereis"
Date:
Sorry I made a mistake , the coredump is only 25 Mb :D

Aarjan
  ----- Original Message -----=20
  From: aarjan langereis=20
  To: pgsql-bugs@postgresql.org=20
  Sent: Wednesday, December 24, 2003 10:49 AM
  Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a t=
emp table


  Ok, just now I understand that 'ulimit' is a bashbuiltin :P. So I fixed t=
hat
  ;)
  I reproduced the crash and got a good core-file (with the symbols, see
  below). The only thing is: it's 390Mb :( .. even after gzip it's big: 250=
Mb.
  Would that be a problem for you? (I have the bandwith if you want)

  (gdb) bt
  #0  0x0819b684 in LogicalTapeWrite ()
  #1  0x0819da13 in ApplySortFunction ()
  #2  0x0819cf10 in tuplesort_getdatum ()
  #3  0x080ea656 in ExecSort ()
  #4  0x080e07ae in ExecProcNode ()
  #5  0x080eaf6f in ExecGroup ()
  #6  0x080e07ef in ExecProcNode ()
  #7  0x080e5302 in ExecAgg ()
  #8  0x080e07bb in ExecProcNode ()
  #9  0x080df3a5 in ExecutorEnd ()
  #10 0x080de8ac in ExecutorRun ()
  #11 0x0813c4bd in ProcessQuery ()
  #12 0x0813a898 in pg_exec_query_string ()
  #13 0x0813b93f in PostgresMain ()
  #14 0x0811f37b in ClosePostmasterPorts ()
  #15 0x0811ee03 in ClosePostmasterPorts ()
  #16 0x0811dc36 in PostmasterMain ()
  #17 0x0811d3a6 in PostmasterMain ()
  #18 0x080f5113 in main ()
  #19 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
  (gdb)


  These settings are used:
  /proc/sys/kernel/shmmax =3D 500000000
  shared_buffers =3D 33000
  sort_mem =3D 98304
  fsync =3D false

  The explain's of both query's:

  stats=3D# explain select * from blocks;
                                  QUERY PLAN
  -------------------------------------------------------------------------=
--
   Seq Scan on blocks  (cost=3D100000000.00..100000020.00 rows=3D1000 width=
=3D130)
  (1 row)

  stats=3D# explain select hostid, sum(amount) from blocks group by hostid;
                                          QUERY PLAN
  -------------------------------------------------------------------------=
---
  ---------------
   Aggregate  (cost=3D100000069.83..100000077.33 rows=3D100 width=3D8)
     ->  Group  (cost=3D100000069.83..100000074.83 rows=3D1000 width=3D8)
           ->  Sort  (cost=3D100000069.83..100000072.33 rows=3D1000 width=
=3D8)
                 Sort Key: hostid
                 ->  Seq Scan on blocks  (cost=3D100000000.00..100000020.00
  rows=3D1000 width=3D8)
  (5 rows)

  stats=3D#

  I hope this gives you some more info,

  Aarjan

  ----- Original Message -----=20
  From: "Tom Lane" <tgl@sss.pgh.pa.us>
  To: "aarjan langereis" <a.j.langereis@chello.nl>
  Cc: <pgsql-bugs@postgresql.org>
  Sent: Sunday, December 21, 2003 6:13 PM
  Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a t=
emp
  table


  > "aarjan langereis" <a.j.langereis@chello.nl> writes:
  > > I don't see that "ulimit -c 0".. nowhere in the script. Where do I ha=
ve
  to
  > > put the "ulimit -c unlimited"?
  >
  > I'd suggest adding it to the pg_ctl script.
  >
  > > Select * from blocks; gave me the whole table (I didn't look at all
  records,
  > > but got a result in psql)
  > > select hostid, sum(amount) from blocks group by hostid; crashed (3,2M
  > > records used)
  >
  > Hm.  What do you have sort_mem set to?  Does the EXPLAIN output show
  > different plans for these queries?
  >
  > regards, tom lane
  >
  > ---------------------------(end of broadcast)---------------------------
  > TIP 9: the planner will ignore your desire to choose an index scan if y=
our
  >       joining column's datatypes do not match
  >

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
Tom Lane
Date:
"aarjan langereis" <a.j.langereis@chello.nl> writes:
> I reproduced the crash and got a good core-file (with the symbols, see
> below). The only thing is: it's 390Mb :( .. even after gzip it's big: 250Mb.
> Would that be a problem for you? (I have the bandwith if you want)

No point in sending it to me, it's of no value on a different machine
(because the executable files are usually not identical).

> (gdb) bt
> #0  0x0819b684 in LogicalTapeWrite ()
> #1  0x0819da13 in ApplySortFunction ()
> #2  0x0819cf10 in tuplesort_getdatum ()
> #3  0x080ea656 in ExecSort ()

Hm.  I'm afraid gdb is lying to you, because this stack trace is
impossible -- those functions don't call each other.  I have seen
that happen when the compiler optimization level is too high; gdb's
stack tracing code gets confused.  To get any useful information,
you'll need to rebuild with debug symbols enabled (--enable-debug
switch to configure).

            regards, tom lane

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
"aarjan langereis"
Date:
I've got gdb from the RPM's from redhat 9. Whould that RPM be wrong then?

Or do you mean that I have to recompile the database?

Aarjan

Ps. Merry Christmas!

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "aarjan langereis" <a.j.langereis@chello.nl>
Cc: <pgsql-bugs@postgresql.org>
Sent: Wednesday, December 24, 2003 4:15 PM
Subject: Re: [BUGS] BUG #1015: Got a signal 11 while trying to create a temp
table


> "aarjan langereis" <a.j.langereis@chello.nl> writes:
> > I reproduced the crash and got a good core-file (with the symbols, see
> > below). The only thing is: it's 390Mb :( .. even after gzip it's big:
250Mb.
> > Would that be a problem for you? (I have the bandwith if you want)
>
> No point in sending it to me, it's of no value on a different machine
> (because the executable files are usually not identical).
>
> > (gdb) bt
> > #0  0x0819b684 in LogicalTapeWrite ()
> > #1  0x0819da13 in ApplySortFunction ()
> > #2  0x0819cf10 in tuplesort_getdatum ()
> > #3  0x080ea656 in ExecSort ()
>
> Hm.  I'm afraid gdb is lying to you, because this stack trace is
> impossible -- those functions don't call each other.  I have seen
> that happen when the compiler optimization level is too high; gdb's
> stack tracing code gets confused.  To get any useful information,
> you'll need to rebuild with debug symbols enabled (--enable-debug
> switch to configure).
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Re: BUG #1015: Got a signal 11 while trying to create a temp table

From
Tom Lane
Date:
"aarjan langereis" <a.j.langereis@chello.nl> writes:
> I've got gdb from the RPM's from redhat 9. Whould that RPM be wrong then?
> Or do you mean that I have to recompile the database?

You need to recompile Postgres.  If you are using an RPM distribution of
Postgres, I think you can get the source RPM (SRPM) and rebuild it with
a debug option, but I don't know the details of that process.  If you
built Postgres yourself, then it's just a matter of adding
--enable-debug to your configure switches and rebuilding.

            regards, tom lane