Thread: BUG #2858: postgres periodically restarts (problem with MemoryContextAllocZeroAligned)...

The following bug has been logged online:

Bug reference:      2858
Logged by:          Robert Locke
Email address:      rob@mobius.ph
PostgreSQL version: 8.1.4
Operating system:   FreeBSD 6.1-RELEASE-p6
Description:        postgres periodically restarts (problem with
MemoryContextAllocZeroAligned)...
Details:

We recently began experiencing a problem with postgres where the server
would periodically restart with messages such as the following in the LOG
file:

Dec 22 14:15:56 MOv2DB postgres[38675]: [100-1] WARNING:  terminating
connection because of crash of another server process
Dec 22 14:15:56 MOv2DB postgres[38675]: [100-2] DETAIL:  The postmaster has
commanded this server process to roll back the current transaction and exit,
because another server
Dec 22 14:15:56 MOv2DB postgres[38675]: [100-3]  process exited abnormally
and possibly corrupted shared memory.
Dec 22 14:15:56 MOv2DB postgres[38675]: [100-4] HINT:  In a moment you
should be able to reconnect to the database and repeat your command.

"dmesg" would reveal errors such as:

pid 34866 (postgres), uid 70: exited on signal 11 (core dumped)
pid 43893 (postgres), uid 70: exited on signal 11 (core dumped)
pid 43907 (postgres), uid 70: exited on signal 11 (core dumped)
pid 46337 (postgres), uid 70: exited on signal 11 (core dumped)

We enabled query logging and found that the process would sometimes die when
a function called "removeAccount" was executed:

46337 2006-12-22 14:21:56 PHT 10.48.14.246 LOG:  statement: SELECT * FROM
core."removeAccount"(5130175)
45166 2006-12-22 14:21:59 PHT  LOG:  server process (PID 46337) was
terminated by signal 11

This function simply executes a number of delete statements to remove a user
from the system.  We discovered, however, that it was a little slow (3 - 4
seconds) because the final delete removed the record from a table which is
referenced as a foreign key in a number of other tables.

Adding a couple of indices greatly improved the performance of the function,
and the problem has now disappeared.  However, we are concerned that this
might indicate a more severe problem with Postgres which might cause further
issues down the road.

Here's a back trace of the core dump for reference:

#0  0x08079d7f in heap_modifytuple ()
#1  0x08079eb6 in slot_getattr ()
#2  0x0816344d in ExecMakeFunctionResult ()
#3  0x081675b7 in ExecQual ()
#4  0x08167bae in ExecScan ()
#5  0x08175547 in ExecSeqScan ()
#6  0x08161b52 in ExecProcNode ()
#7  0x08160a8e in ExecutorRun ()
#8  0x0817ae0f in spi_printtup ()
#9  0x0817b9b0 in SPI_execute_snapshot ()
#10 0x00000000 in ?? ()
#11 0x00000000 in ?? ()
#12 0x00000000 in ?? ()
#13 0x00000001 in ?? ()
#14 0x083ecc88 in ?? ()
#15 0xbfbfa3e8 in ?? ()
#16 0x0000000a in ?? ()
#17 0x08607018 in ?? ()
#18 0x00000001 in ?? ()
#19 0xbfbfa588 in ?? ()
#20 0x08299fa9 in RI_Initial_Check ()
#21 0x083ecad8 in ?? ()
#22 0xbfbfa458 in ?? ()
#23 0x082e08ca in MemoryContextAllocZeroAligned ()
Previous frame inner to this frame (corrupt stack?)

Any ideas?