Thread: BUG #8578: loading a 33G (compressed) pg_dump into a fresh host and db instance crashes a postgresql process

The following bug has been logged on the website:

Bug reference:      8578
Logged by:          Robert Nix
Email address:      robert@urban4m.com
PostgreSQL version: 9.3.1
Operating system:   Linux Ubuntu 12.10
Description:

Below are the logs revealing the failure:


2013-11-05 17:04:21 EST LOG:  server process (PID 5869) was terminated by
signal 9: Killed
2013-11-05 17:04:21 EST DETAIL:  Failed process was running: autovacuum:
ANALYZE data_partitions.zone_municipality_demog_houston
2013-11-05 17:04:21 EST LOG:  terminating any other active server processes
2013-11-05 17:04:21 EST WARNING:  terminating connection because of crash of
another server process
2013-11-05 17:04:21 EST DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-11-05 17:04:21 EST HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-11-05 17:04:21 EST WARNING:  terminating connection because of crash of
another server process
2013-11-05 17:04:21 EST DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-11-05 17:04:21 EST HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-11-05 17:04:21 EST WARNING:  terminating connection because of crash of
another server process
2013-11-05 17:04:21 EST DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-11-05 17:04:21 EST HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-11-05 17:04:21 EST WARNING:  terminating connection because of crash of
another server process
2013-11-05 17:04:21 EST DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit, because another
server process exited abnormally and possibly corrupted shared memory.
2013-11-05 17:04:21 EST HINT:  In a moment you should be able to reconnect
to the database and repeat your command.
2013-11-05 17:04:21 EST CONTEXT:  COPY zone_zipcode_demog_orlando, line
8986
2013-11-05 17:04:23 EST LOG:  all server processes terminated;
reinitializing
2013-11-05 17:04:23 EST LOG:  database system was interrupted; last known up
at 2013-11-05 17:01:22 EST
2013-11-05 17:04:23 EST LOG:  database system was not properly shut down;
automatic recovery in progress
2013-11-05 17:04:23 EST LOG:  redo starts at 57/643FA1F8
2013-11-05 17:04:47 EST LOG:  unexpected pageaddr 56/E5784000 in log segment
00000001000000570000008F, offset 7880704
2013-11-05 17:04:47 EST LOG:  redo done at 57/8F783B68
2013-11-05 17:04:47 EST LOG:  last completed transaction was at log time
2013-11-05 17:01:52.611585-05
2013-11-05 17:04:53 EST LOG:  database system is ready to accept
connections
2013-11-05 17:04:53 EST LOG:  autovacuum launcher started




psql reports:


WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
CONTEXT:  COPY zone_zipcode_demog_orlando, line 8986
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost




The command being issued is:


dropdb u; createdb u; cat u.sql.gz | gzip -d | psql u


u.sql.gz was created as:


pg_dump u | gzip >u.sql.gz


The database from which the dump was created is the same, 9.3.1, same
operating system, both Ubuntu 12.10. Db's have same extensions installed.
The source database shows no signs of issues.
robert@urban4m.com writes:
> 2013-11-05 17:04:21 EST LOG:  server process (PID 5869) was terminated by
> signal 9: Killed

OOM killer, almost certainly.  Check your kernel log to confirm.
Then read about preventing OOM kills:
http://www.postgresql.org/docs/9.3/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

            regards, tom lane
Thanks for the quick response, Tom. Sure enough. I found the OOM message in
syslog. And thanks for the link. I'll try those suggestions. Just out of
curiosity, will using psql -f alter the behavior, in this case, with
respect to memory?
Thanks again


On Tue, Nov 5, 2013 at 8:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> robert@urban4m.com writes:
> > 2013-11-05 17:04:21 EST LOG:  server process (PID 5869) was terminated by
> > signal 9: Killed
>
> OOM killer, almost certainly.  Check your kernel log to confirm.
> Then read about preventing OOM kills:
>
> http://www.postgresql.org/docs/9.3/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
>
>                         regards, tom lane
>



--
.nix
Robert Nix <robert@urban4m.com> writes:
> Thanks for the quick response, Tom. Sure enough. I found the OOM message in
> syslog. And thanks for the link. I'll try those suggestions. Just out of
> curiosity, will using psql -f alter the behavior, in this case, with
> respect to memory?

No, since the memory consumption was happening on the server side.

It is somewhat of interest why you're apparently getting a memory leak on
the server side, though.  Maybe you were using an unsafely large work_mem
or maintenance_work_mem setting?  I've also seen some reports lately
suggesting that PostGIS might cause intra-query memory leaks, which might
explain this if data_partitions.zone_municipality_demog_houston contains
any PostGIS data types.

            regards, tom lane
re: *work_mem. They are the defaults. In fact, i only changed
checkpoint_segments to 64 because of too frequent checkpointing. All others
are still the defaults.

re: PostGIS. Yes, the table has 3 geometries. Most of the database is
tables with geometries and some are far larger than that one and most of
the largest will have been loaded before that one.

It's running again after changing the sysctl setting and i'm monitoring
memory. Will keep you posted.

Thanks



On Tue, Nov 5, 2013 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Robert Nix <robert@urban4m.com> writes:
> > Thanks for the quick response, Tom. Sure enough. I found the OOM message
> in
> > syslog. And thanks for the link. I'll try those suggestions. Just out of
> > curiosity, will using psql -f alter the behavior, in this case, with
> > respect to memory?
>
> No, since the memory consumption was happening on the server side.
>
> It is somewhat of interest why you're apparently getting a memory leak on
> the server side, though.  Maybe you were using an unsafely large work_mem
> or maintenance_work_mem setting?  I've also seen some reports lately
> suggesting that PostGIS might cause intra-query memory leaks, which might
> explain this if data_partitions.zone_municipality_demog_houston contains
> any PostGIS data types.
>
>                         regards, tom lane
>



--
.nix
The sysctl change worked. Thanks for helping, Tom. Very much appreciated.


On Tue, Nov 5, 2013 at 10:05 PM, Robert Nix <robert@urban4m.com> wrote:

> re: *work_mem. They are the defaults. In fact, i only changed
> checkpoint_segments to 64 because of too frequent checkpointing. All others
> are still the defaults.
>
> re: PostGIS. Yes, the table has 3 geometries. Most of the database is
> tables with geometries and some are far larger than that one and most of
> the largest will have been loaded before that one.
>
> It's running again after changing the sysctl setting and i'm monitoring
> memory. Will keep you posted.
>
> Thanks
>
>
>
> On Tue, Nov 5, 2013 at 9:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Robert Nix <robert@urban4m.com> writes:
>> > Thanks for the quick response, Tom. Sure enough. I found the OOM
>> message in
>> > syslog. And thanks for the link. I'll try those suggestions. Just out of
>> > curiosity, will using psql -f alter the behavior, in this case, with
>> > respect to memory?
>>
>> No, since the memory consumption was happening on the server side.
>>
>> It is somewhat of interest why you're apparently getting a memory leak on
>> the server side, though.  Maybe you were using an unsafely large work_mem
>> or maintenance_work_mem setting?  I've also seen some reports lately
>> suggesting that PostGIS might cause intra-query memory leaks, which might
>> explain this if data_partitions.zone_municipality_demog_houston contains
>> any PostGIS data types.
>>
>>                         regards, tom lane
>>
>
>
>
> --
> .nix
>



--
.nix