Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create(). - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().
Date
Msg-id CAEepm=1ySPpiPG8hyy98vTGbvi6uXvavsy-zahevkM=tTRUDSQ@mail.gmail.com
Whole thread Raw
In response to Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Aug 13, 2016 at 8:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Aug 13, 2016 at 2:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> amul sul <sul_amul@yahoo.co.in> writes:
>>> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when
latertry to access the memory.  This happens because of my system does not have enough shm space &  current allocation
indsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file)
toensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times. 
>>
>> Wouldn't that just result in a segfault during dsm_create?
>>
>> I think probably what you are describing here is kernel misbehavior
>> akin to memory overcommit.  Maybe it *is* memory overcommit and can
>> be turned off the same way.  If not, you have material for a kernel
>> bug fix/enhancement request.
>
> [...] But it
> looks like if we used fallocate or posix_fallocate in the
> dsm_impl_posix case we'd get a nice ESPC error, instead of
> success-but-later-SIGBUS-on-access.

Here's a simple test extension that creates jumbo dsm segments, and
then accesses all pages.  If you ask it to write cheques that your
Linux 3.10 machine can't cash on unpatched master, it does this:

postgres=# create extension foo;
CREATE EXTENSION
postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
server closed the connection unexpectedly
...
LOG:  server process (PID 15105) was terminated by signal 7: Bus error

If I apply the attached experimental patch I get:

postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
ERROR:  could not resize shared memory segment
"/PostgreSQL.1938734921" to 17179869184 bytes: No space left on device

It should probably be refactored a bit to separate the error messages
for ftruncate and posix_fallocate, and we could possibly use the same
approach for dsm_impl_mmap instead of that write() loop, but this at
least demonstrates the problem Amul reported.  Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add hint for function named "is"
Next
From: Rushabh Lathia
Date:
Subject: [parallel query] random server crash while running tpc-h query on power2