Thread: NFS vs. PostgreSQL on Solaris

NFS vs. PostgreSQL on Solaris

From
"Thomas F. O'Connell"
Date:
I've got an interesting case study that I don't fully understand from a postgres perspective, and I'm hoping that someone in the community might help me understand what led to the outcome and whether it's easily prevented or not.

The setup: PostgreSQL 8.2.3 on Solaris 10 x86

postgresql.conf:
shared_buffers = 200000

This had been a box that had been completely dedicated to postgres. A new project required that the box start to also share a directory over NFS. In preparation for this, nfsd had been running for quite some time, although today was the first day that production mount points on other boxes were added. The first external mount point was created this morning, and all seemed quiet. The second external mount point was then added this afternoon. That second mount point caused postgres great consternation.

Within several few seconds of enabling the second external NFS mount point, postgres began reporting out of memory errors. An INSERT generated the following:

TopMemoryContext: 118832 total in 10 blocks; 17512 free (24 chunks); 101320 used
TopTransactionContext: 8192 total in 1 blocks; 7856 free (0 chunks); 336 used
Type information cache: 8192 total in 1 blocks; 1800 free (0 chunks); 6392 used
Operator class cache: 8192 total in 1 blocks; 4872 free (0 chunks); 3320 used
MessageContext: 262144 total in 6 blocks; 75208 free (5 chunks); 186936 used
smgr relation table: 8192 total in 1 blocks; 1776 free (0 chunks); 6416 used
TransactionAbortContext: 32768 total in 1 blocks; 32752 free (0 chunks); 16 used
Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used
PortalHeapMemory: 1024 total in 1 blocks; 400 free (0 chunks); 624 used
ExecutorState: 147648 total in 7 blocks; 59816 free (11 chunks); 87832 used
HashTableContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
HashBatchContext: 25174100 total in 13 blocks; 7754600 free (16 chunks); 17419500 used
TupleSort: 4448280 total in 10 blocks; 1888 free (25 chunks); 4446392 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Unique: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
AggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 8192 total in 1 blocks; 2856 free (0 chunks); 5336 used
CacheMemoryContext: 659000 total in 19 blocks; 140248 free (1 chunks); 518752 us
ed

This was followed by a number of statements about individual relations and, eventually, this:

MdSmgr: 8192 total in 1 blocks; 7120 free (0 chunks); 1072 used
LOCALLOCK hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
2007-04-26 14:47:42 CDT 15175 :LOG:  could not fork new process for connection: Not enough space
Timezones: 48616 total in 2 blocks; 5968 free (0 chunks); 42648 used
ErrorContext: 8192 total in 1 blocks; 8176 free (4 chunks); 16 used
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):ERROR:  out of memory
2007-04-26 14:47:42 CDT 17018 10.0.4.168(46723):DETAIL:  Failed on request of size 257.

This happened intermittently for the next several minutes, eventually building to FATAL:

2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):FATAL:  out of memory
2007-04-26 14:50:12 CDT 18421 10.0.4.168(44414):DETAIL:  Failed on request of size 1864.

The FATAL error was preceded by a similar report to the original ERRORs. I don't know what distinguishes the ERROR "out of memory" from the FATAL version.

Then, eventually, without any evidence of other things on the box behaving badly, we saw this:

2007-04-26 14:53:05 CDT 15175 :LOG:  server process (PID 18543) was terminated by signal 11
2007-04-26 14:53:05 CDT 15175 :LOG:  terminating any other active server processes

So my questions are:

1. What aspect of postgres' memory usage would create an "out of memory" condition?
2. What is the difference between an ERROR and FATAL "out of memory" message?
3. What would cause postgres to die from a signal 11?

I've also got a core file if that's necessary for further forensics.

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

615-260-0005

Re: NFS vs. PostgreSQL on Solaris

From
Tom Lane
Date:
"Thomas F. O'Connell" <tf@o.ptimized.com> writes:
> 1. What aspect of postgres' memory usage would create an "out of
> memory" condition?

I'm guessing you ran the box out of swap space --- look into what other
processes got started as a result of adding the NFS mount, and how much
memory they wanted to eat.

> 3. What would cause postgres to die from a signal 11?
> I've also got a core file if that's necessary for further forensics.

Send gdb backtrace, please.

            regards, tom lane

Re: NFS vs. PostgreSQL on Solaris

From
"Thomas F. O'Connell"
Date:
On Apr 26, 2007, at 6:51 PM, Tom Lane wrote:

> "Thomas F. O'Connell" <tf@o.ptimized.com> writes:
>> 1. What aspect of postgres' memory usage would create an "out of
>> memory" condition?
>
> I'm guessing you ran the box out of swap space --- look into what
> other
> processes got started as a result of adding the NFS mount, and how
> much
> memory they wanted to eat.
>
>> 3. What would cause postgres to die from a signal 11?
>> I've also got a core file if that's necessary for further forensics.
>
> Send gdb backtrace, please.
>
>             regards, tom lane

Unfortunately, the production build in question is lacking --enable-
debug. :(

--
Thomas F. O'Connell

optimizing modern web applications
: for search engines, for usability, and for performance :

http://o.ptimized.com/
615-260-0005


Re: NFS vs. PostgreSQL on Solaris

From
Tom Lane
Date:
"Thomas F. O'Connell" <tf@o.ptimized.com> writes:
> On Apr 26, 2007, at 6:51 PM, Tom Lane wrote:
>> "Thomas F. O'Connell" <tf@o.ptimized.com> writes:
>>> 3. What would cause postgres to die from a signal 11?
>>> I've also got a core file if that's necessary for further forensics.
>>
>> Send gdb backtrace, please.

> Unfortunately, the production build in question is lacking --enable-
> debug. :(

Well, if it wasn't actually stripped then gdb could still get function
names out of it, which might or might not be enough but it's sure more
info than you provided so far.

If you built with gcc, then a possible plan B is to recompile with all
the same options plus --enable-debug, and hope that the resulting
executables are bit-for-bit the same except for addition of debug
symbols, so you could use them with the corefile.  This theoretically
should work, if nothing has changed in your build environment, though
that assumption is obviously a bit shaky.

            regards, tom lane