Thread: BUG #2225: Backend crash -- BIG table

BUG #2225: Backend crash -- BIG table

From
"Patrick Rotsaert"
Date:
The following bug has been logged online:

Bug reference:      2225
Logged by:          Patrick Rotsaert
Email address:      patrick.rotsaert@arrowup.be
PostgreSQL version: 8.1.2
Operating system:   Linux
Description:        Backend crash -- BIG table
Details:

Situation:
---------
Database with 1 table:
     CREATE TABLE pptran
    (
      cchk int4,
      trid char(14),
      csnr char(13),
      amount int4,
      date date,
      "time" time,
      lane int4,
      "type" int4,
      data char(24),
      stat int4,
      skip int4,
      retry int4,
      points_bc int4,
      points_chip int4,
      trid_tid int2,
      trid_seq int2
    );

This table contains approx. 36 million records!

    CREATE INDEX pptran_trid
      ON pptran
      USING btree (trid);

Problem:
-------
Executing a select query causes the backend to crash. This is the output
from the psql frontend:

    pointspp=# select trid, count(*) from pptran group by trid having count(*)
> 1;
    server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    The connection to the server was lost. Attempting reset: Failed.
    !>

This error appears after several minutes. During execution, 'top' shows
about 4% CPU usage and 98% memory usage of the postmaster process.

At the time of the crash, the server logs:
    LOG:  server process (PID 21815) was terminated by signal 9
    LOG:  terminating any other active server processes
    FATAL:  the database system is in recovery mode
    LOG:  all server processes terminated; reinitializing
    LOG:  database system was interrupted at 2006-01-30 15:04:31 CET
    LOG:  checkpoint record is at 3/275944AC
    LOG:  redo record is at 3/275944AC; undo record is at 0/0; shutdown TRUE
    LOG:  next transaction ID: 5415; next OID: 16444
    LOG:  next MultiXactId: 1; next MultiXactOffset: 0
    LOG:  database system was not properly shut down; automatic recovery in
progress
    LOG:  record with zero length at 3/275944F0
    LOG:  redo is not required
    LOG:  database system is ready
    LOG:  transaction ID wrap limit is 2147484146, limited by database
"postgres"


Platform info:
-------------
- PostgreSQL version 8.1.2
- Linux Slackware 9.1.0, Kernel 2.4.22
- /proc/cpuinfo
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 15
    model           : 2
    model name      : Intel(R) Pentium(R) 4 CPU 2.20GHz
    stepping        : 4
    cpu MHz         : 2192.973
    cache size      : 512 KB
    fdiv_bug        : no
    hlt_bug         : no
    f00f_bug        : no
    coma_bug        : no
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 2
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
    bogomips        : 4377.80
- /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
    Mem:  926220288 33148928 893071360        0  1163264 17268736
    Swap: 2048376832 18296832 2030080000
    MemTotal:       904512 kB
    MemFree:        872140 kB
    MemShared:           0 kB
    Buffers:          1136 kB
    Cached:          15288 kB
    SwapCached:       1576 kB
    Active:           5824 kB
    Inactive:        15820 kB
    HighTotal:           0 kB
    HighFree:            0 kB
    LowTotal:       904512 kB
    LowFree:        872140 kB
    SwapTotal:     2000368 kB
    SwapFree:      1982500 kB

Re: BUG #2225: Backend crash -- BIG table

From
Tom Lane
Date:
"Patrick Rotsaert" <patrick.rotsaert@arrowup.be> writes:
> At the time of the crash, the server logs:
>     LOG:  server process (PID 21815) was terminated by signal 9

You're running on a Linux machine with memory overcommit enabled.
Turn that off, or nothing will ever work very reliably --- the OOM
killer is entirely capable of zapping innocent processes that have
nothing to do with the one eating too much memory; and even when it
kills the right process, "kill -9" is not IMHO an acceptable way for
the system to tell a process it can't have any more memory.  See
http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18105

            regards, tom lane

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
on 31/01/2006 16:18 Tom Lane wrote :

>"Patrick Rotsaert" <patrick.rotsaert@arrowup.be> writes:
>
>
>>At the time of the crash, the server logs:
>>    LOG:  server process (PID 21815) was terminated by signal 9
>>
>>
>
>You're running on a Linux machine with memory overcommit enabled.
>Turn that off, or nothing will ever work very reliably --- the OOM
>killer is entirely capable of zapping innocent processes that have
>nothing to do with the one eating too much memory; and even when it
>kills the right process, "kill -9" is not IMHO an acceptable way for
>the system to tell a process it can't have any more memory.  See
>http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18105
>
>            regards, tom lane
>
>
Syslog indeed shows:
    kernel: Out of Memory: Killed process 21815 (postmaster).

Looking at the kernel source mm/mmap.c, the function `int
vm_enough_memory(long pages)' does 1 simple test:
        /* Sometimes we want to use more memory than we have. */
        if (sysctl_overcommit_memory)
            return 1;
But /proc/sys/vm/overcommit_memory  reads `0', so my guess is that
overcommit is not enabled... right?
Any hints?

Thanks,
Patrick Rotsaert

Re: BUG #2225: Backend crash -- BIG table

From
Tom Lane
Date:
Patrick Rotsaert <patrick.rotsaert@arrowup.be> writes:
> on 31/01/2006 16:18 Tom Lane wrote :
>> http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18105

> But /proc/sys/vm/overcommit_memory  reads `0', so my guess is that
> overcommit is not enabled... right?

Please read the reference I pointed you to.

            regards, tom lane

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
Tom Lane wrote:

>Patrick Rotsaert <patrick.rotsaert@arrowup.be> writes:
>
>
>>on 31/01/2006 16:18 Tom Lane wrote :
>>
>>
>>>http://www.postgresql.org/docs/8.1/static/kernel-resources.html#AEN18105
>>>
>>>
>
>
>
>>But /proc/sys/vm/overcommit_memory  reads `0', so my guess is that
>>overcommit is not enabled... right?
>>
>>
>
>Please read the reference I pointed you to.
>
>            regards, tom lane
>
>
I did read it, very carefully. The proposed fix will only work in 2.6
kernels. Mine is a 2.4 and upgrading it is not an option. The document
suggests to look at the kernel source for 2.4 kernels. I did that, as I
wrote in the previous mail. Setting the overcommit parameter to '2', or
any value for that matter, won't do any good because in this kernel, it
is only tested if it is non-zero. On my system, the parameter is 0, so
overcommit is *not* enabled. I don't know what else I can do.
The other proposed option is to install more memory. Sorry, not an
option, 1GB has to be sufficient.

Apart from the overcommit subject, why is postgres consuming so much
memory? Should the solution of this problem not be searched for here?

Thanks,
Patrick Rotsaert

Re: BUG #2225: Backend crash -- BIG table

From
Peter Eisentraut
Date:
Patrick Rotsaert wrote:
> On my
> system, the parameter is 0, so overcommit is *not* enabled.

0 means overcommit is enabled.  You want to set it to something other
than 0 to prevent overcommitting and the consequent suprising process
deaths.  Exactly what other values are accepted varies, but 0 isn't the
one for you.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: BUG #2225: Backend crash -- BIG table

From
Bruno Wolff III
Date:
On Wed, Feb 01, 2006 at 17:57:50 +0100,
  Patrick Rotsaert <patrick.rotsaert@arrowup.be> wrote:
> I did read it, very carefully. The proposed fix will only work in 2.6
> kernels. Mine is a 2.4 and upgrading it is not an option. The document
> suggests to look at the kernel source for 2.4 kernels. I did that, as I
> wrote in the previous mail. Setting the overcommit parameter to '2', or
> any value for that matter, won't do any good because in this kernel, it
> is only tested if it is non-zero. On my system, the parameter is 0, so
> overcommit is *not* enabled. I don't know what else I can do.
> The other proposed option is to install more memory. Sorry, not an
> option, 1GB has to be sufficient.

Is there some reason you can't add more swap space?

> Apart from the overcommit subject, why is postgres consuming so much
> memory? Should the solution of this problem not be searched for here?

How do you know it is Postgres that is using lots of memory? The OOM killer
doesn't just kill of memory hogs, so you can't just assume the processes
being killed tells you which processes were using lots of memory.

The memory that Postgres uses is controlled in postgresql.conf. One particular
gotcha is that sortmem is per sort, so if you have a number of concurrent
sorts you might be using more memory than you expected.

Re: BUG #2225: Backend crash -- BIG table

From
Stephan Szabo
Date:
On Mon, 30 Jan 2006, Patrick Rotsaert wrote:

> Problem:
> -------
> Executing a select query causes the backend to crash. This is the output
> from the psql frontend:
>
>     pointspp=# select trid, count(*) from pptran group by trid having count(*)
> > 1;
>     server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
>     The connection to the server was lost. Attempting reset: Failed.
>     !>

One question is what does the explain (without analyze) plan look like for
the above and are the row estimates valid in the case of one of the hash
plans.

Failing that, how many rows should the above return?

Re: BUG #2225: Backend crash -- BIG table

From
Stephan Szabo
Date:
On Fri, 3 Feb 2006, Patrick Rotsaert wrote:

>
> >One question is what does the explain (without analyze) plan look like for
> >the above and are the row estimates valid in the case of one of the hash
> >plans.
> >
> >
> pointspp=# explain select trid, count(*) from pptran group by trid
> having count(*) > 1;
>                                 QUERY PLAN
> --------------------------------------------------------------------------
>  HashAggregate  (cost=1311899.28..1311902.78 rows=200 width=18)
>    Filter: (count(*) > 1)
>    ->  Seq Scan on pptran  (cost=0.00..1039731.02 rows=36289102 width=18)
> (3 rows)

Hmm, if you do an enable_hashagg=false and then run the query (without
explain) does it work then?

> >Failing that, how many rows should the above return?
> >
> >
> That is exactly what I am trying to find out. I can only guess that, but
> it should not be more than a couple of 10k rows.

Okay, so it's unlikely to be a client side memory issue (if it were
returning alot of rows, that could also be an issue).

Re: BUG #2225: Backend crash -- BIG table

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>> pointspp=# explain select trid, count(*) from pptran group by trid
>> having count(*) > 1;
>> QUERY PLAN
>> --------------------------------------------------------------------------
>> HashAggregate  (cost=1311899.28..1311902.78 rows=200 width=18)
>> Filter: (count(*) > 1)
>> ->  Seq Scan on pptran  (cost=0.00..1039731.02 rows=36289102 width=18)
>> (3 rows)

>>> Failing that, how many rows should the above return?

>> That is exactly what I am trying to find out. I can only guess that, but
>> it should not be more than a couple of 10k rows.

The problem is that the HashAgg will have to maintain a counter for
every distinct value of trid, not just those that occur more than
once.  So if there are a huge number of one-time-only values you could
still blow out memory (and HashAgg doesn't currently know how to spill
to disk).

That "rows=200" estimate looks suspiciously like a default.  Has this
table been ANALYZEd recently?  I'd expect the planner not to choose
HashAgg if it has a more realistic estimate of the number of groups.

            regards, tom lane

Re: BUG #2225: Backend crash -- BIG table

From
Tom Lane
Date:
Patrick Rotsaert <patrick.rotsaert@arrowup.be> writes:
> I did a vacuum analyze, now the explain gives different results.

> pointspp=# explain select trid, count(*) from pptran group by trid
> having count(*) > 1;
>                                    QUERY PLAN
> --------------------------------------------------------------------------------
>  GroupAggregate  (cost=9842885.29..10840821.57 rows=36288592 width=18)
>    Filter: (count(*) > 1)
>    ->  Sort  (cost=9842885.29..9933606.77 rows=36288592 width=18)
>          Sort Key: trid
>          ->  Seq Scan on pptran  (cost=0.00..1039725.92 rows=36288592
> width=18)
> (5 rows)

OK that looks more reasonable.

> pointspp=# select trid, count(*) from pptran group by trid having
> count(*) > 1;
> ERROR:  could not write block 661572 of temporary file: No space left on
> device
> HINT:  Perhaps out of disk space?

> I have 5.1GB of free disk space. If this is the cause, I have a
> problem... or is there another way to extract (and remove) duplicate rows?

Time to buy more disk :-(

            regards, tom lane

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
>Is there some reason you can't add more swap space?
>
>
Yes, disk space. I have about 2 GB of swap space enabled.

>How do you know it is Postgres that is using lots of memory? The OOM killer
>doesn't just kill of memory hogs, so you can't just assume the processes
>being killed tells you which processes were using lots of memory.
>
>
>
I keep 'top' running while I launch the query. One single postmaster
climbs to the top, claiming 98% MEM and about 8% CPU.

>The memory that Postgres uses is controlled in postgresql.conf. One particular
>gotcha is that sortmem is per sort, so if you have a number of concurrent
>sorts you might be using more memory than you expected.
>
>
I am very sure there are no other queries running at the same time. This
is a development machine of which I have full control.

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
>One question is what does the explain (without analyze) plan look like for
>the above and are the row estimates valid in the case of one of the hash
>plans.
>
>
pointspp=# explain select trid, count(*) from pptran group by trid
having count(*) > 1;
                                QUERY PLAN
--------------------------------------------------------------------------
 HashAggregate  (cost=1311899.28..1311902.78 rows=200 width=18)
   Filter: (count(*) > 1)
   ->  Seq Scan on pptran  (cost=0.00..1039731.02 rows=36289102 width=18)
(3 rows)

>Failing that, how many rows should the above return?
>
>
That is exactly what I am trying to find out. I can only guess that, but
it should not be more than a couple of 10k rows.

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
>0 means overcommit is enabled.  You want to set it to something other
>than 0 to prevent overcommitting and the consequent suprising process
>deaths.  Exactly what other values are accepted varies, but 0 isn't the
>one for you.
>
>
>
I do not understand how 0 could mean overcommit is enabled. I do not
know how it is in recent kernels, but the source code of the 2.4 kernel
I use is this:

int vm_enough_memory(long pages)
{
        unsigned long free;

        /* Sometimes we want to use more memory than we have. */
        if (sysctl_overcommit_memory)
            return 1;

        // ...
}

seems pretty straightforward to me.
I also did a recursive grep through all of the kernel source and this is
the only place where this parameter is used.
I tried setting the parameter to 1, but it did not make any difference.

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
>Hmm, if you do an enable_hashagg=false and then run the query (without
>explain) does it work then?
>
>
pointspp=# set enable_hashagg = false;
SET
pointspp=# select trid, count(*) from pptran group by trid having
count(*) > 1;
ERROR:  could not write block 661582 of temporary file: No space left on
device
HINT:  Perhaps out of disk space?

Still does not work, but it no longer consumes that same amount of memory

Re: BUG #2225: Backend crash -- BIG table

From
Patrick Rotsaert
Date:
>The problem is that the HashAgg will have to maintain a counter for
>every distinct value of trid, not just those that occur more than
>once.  So if there are a huge number of one-time-only values you could
>still blow out memory (and HashAgg doesn't currently know how to spill
>to disk).
>
>
One-time-only values are in my case more probable, so it will use a lot
of counters.

>That "rows=200" estimate looks suspiciously like a default.  Has this
>table been ANALYZEd recently?  I'd expect the planner not to choose
>HashAgg if it has a more realistic estimate of the number of groups.
>
>            regards, tom lane
>
>
I did a vacuum analyze, now the explain gives different results.

pointspp=# vacuum analyze;
VACUUM

pointspp=# explain select trid, count(*) from pptran group by trid
having count(*) > 1;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 GroupAggregate  (cost=9842885.29..10840821.57 rows=36288592 width=18)
   Filter: (count(*) > 1)
   ->  Sort  (cost=9842885.29..9933606.77 rows=36288592 width=18)
         Sort Key: trid
         ->  Seq Scan on pptran  (cost=0.00..1039725.92 rows=36288592
width=18)
(5 rows)

pointspp=# select trid, count(*) from pptran group by trid having
count(*) > 1;
ERROR:  could not write block 661572 of temporary file: No space left on
device
HINT:  Perhaps out of disk space?

I have 5.1GB of free disk space. If this is the cause, I have a
problem... or is there another way to extract (and remove) duplicate rows?

Re: BUG #2225: Backend crash -- BIG table

From
Bruno Wolff III
Date:
On Fri, Feb 03, 2006 at 19:38:04 +0100,
  Patrick Rotsaert <patrick.rotsaert@arrowup.be> wrote:
>
> I have 5.1GB of free disk space. If this is the cause, I have a
> problem... or is there another way to extract (and remove) duplicate rows?

How about processing a subset of the ids in one pass and then may make
multiple passes to check all of the ids. As long as you don't have to use
too small of chunks, this might work for you.