Thread: Server crash due to SIGBUS(Bus Error) when trying to access the memory created using dsm_create().

Hi All,

When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when later try to access the memory.  This happens because of my system does not have enough shm space &  current allocation in dsm_impl_posix does not allocate disk blocks[1].
 
I wonder can we use fallocate system call (i.e. Zero-fill the file) to ensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory mapping. But here we will endup by loop calling ‘write’ squillions of times.

Thoughts/Suggestions ?

Similar post: 
 [1] http://uk.comp.os.linux.narkive.com/Ve44sO4i/shared-memory-problem-no-space-at-dev-shm-causes-sigbus 


Regards,
Amul Sul
amul sul <sul_amul@yahoo.co.in> writes:
> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when later
tryto access the memory.  This happens because of my system does not have enough shm space &  current allocation in
dsm_impl_posixdoes not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file) to
ensurethat all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times.
 

Wouldn't that just result in a segfault during dsm_create?

I think probably what you are describing here is kernel misbehavior
akin to memory overcommit.  Maybe it *is* memory overcommit and can
be turned off the same way.  If not, you have material for a kernel
bug fix/enhancement request.
        regards, tom lane



No segfault during dsm_create,  mmap returns  the memory address  which is inaccessible. 

Let me see how can I disable kernel overcommit behaviour, but  IMHO,  we should prevent ourselves from crashing,  shouldn't we? 

Regards,  
Amul
---------------------------------------------------------------------------------------------------- 
Sent from a mobile device. Please excuse brevity and tpyos.

On Fri, 12 Aug, 2016 at 7:38 pm, Tom Lane
<tgl@sss.pgh.pa.us> wrote:
amul sul <sul_amul@yahoo.co.in> writes:

> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when later try to access the memory.  This happens because of my system does not have enough shm space &  current allocation in dsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file) to ensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory mapping. But here we will endup by loop calling ‘write’ squillions of times.


Wouldn't that just result in a segfault during dsm_create?

I think probably what you are describing here is kernel misbehavior
akin to memory overcommit.  Maybe it *is* memory overcommit and can
be turned off the same way.  If not, you have material for a kernel
bug fix/enhancement request.

            regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Fri, Aug 12, 2016 at 1:55 PM, amul sul <sul_amul@yahoo.co.in> wrote:
> No segfault during dsm_create,  mmap returns  the memory address  which is
> inaccessible.
>
> Let me see how can I disable kernel overcommit behaviour, but  IMHO,  we
> should prevent ourselves from crashing,  shouldn't we?

Overcommit can be caught at dsm_create time with an mlock call (note,
that MAP_LOCKED isn't the same). Even a transient lock (mlock +
munlock) could prevent sigbus due to overcommit.



On Sat, Aug 13, 2016 at 2:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> amul sul <sul_amul@yahoo.co.in> writes:
>> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when
latertry to access the memory.  This happens because of my system does not have enough shm space &  current allocation
indsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file)
toensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times. 
>
> Wouldn't that just result in a segfault during dsm_create?
>
> I think probably what you are describing here is kernel misbehavior
> akin to memory overcommit.  Maybe it *is* memory overcommit and can
> be turned off the same way.  If not, you have material for a kernel
> bug fix/enhancement request.

I think this may be different from overcommit.

In dsm_impl_posix we do shm_open, then ftruncate.  That creates a file
with a hole.  Based on an LKML discussion where someone tried to
address this with a patch that was rejected[1], it believe that Linux
implements POSIX shmem as a tmpfs file and in this case the file has a
hole, which is not the same phenomenon as unallocated virtual memory
pages resulting from overcommit policy.

In dsm_impl_mmap it looks like we have code to deal with the same
problem:  we do open, then, ftruncate, and then we explicitly write a
bunch of zeros to the file, with this comment:
       /*        * Zero-fill the file. We have to do this the hard way to ensure that        * all the file space has
reallybeen allocated, so that we don't        * later seg fault when accessing the memory mapping.  This is pretty
 * pessimal.        */ 

Maybe we didn't do that for dsm_impl_posix because maybe you can't
write to a fd created with shm_open like that, I don't know.  But it
looks like if we used fallocate or posix_fallocate in the
dsm_impl_posix case we'd get a nice ESPC error, instead of
success-but-later-SIGBUS-on-access.  Whether there is *also* the
possibility of overcommit biting you later I don't know, but I suspect
that's an independent problem.  The OOM killer kills you with SIGKILL,
not SIGBUS.

[1] https://lkml.org/lkml/2013/7/31/64

--
Thomas Munro
http://www.enterprisedb.com



On Sat, Aug 13, 2016 at 8:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Aug 13, 2016 at 2:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> amul sul <sul_amul@yahoo.co.in> writes:
>>> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when
latertry to access the memory.  This happens because of my system does not have enough shm space &  current allocation
indsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file)
toensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times. 
>>
>> Wouldn't that just result in a segfault during dsm_create?
>>
>> I think probably what you are describing here is kernel misbehavior
>> akin to memory overcommit.  Maybe it *is* memory overcommit and can
>> be turned off the same way.  If not, you have material for a kernel
>> bug fix/enhancement request.
>
> [...] But it
> looks like if we used fallocate or posix_fallocate in the
> dsm_impl_posix case we'd get a nice ESPC error, instead of
> success-but-later-SIGBUS-on-access.

Here's a simple test extension that creates jumbo dsm segments, and
then accesses all pages.  If you ask it to write cheques that your
Linux 3.10 machine can't cash on unpatched master, it does this:

postgres=# create extension foo;
CREATE EXTENSION
postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
server closed the connection unexpectedly
...
LOG:  server process (PID 15105) was terminated by signal 7: Bus error

If I apply the attached experimental patch I get:

postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
ERROR:  could not resize shared memory segment
"/PostgreSQL.1938734921" to 17179869184 bytes: No space left on device

It should probably be refactored a bit to separate the error messages
for ftruncate and posix_fallocate, and we could possibly use the same
approach for dsm_impl_mmap instead of that write() loop, but this at
least demonstrates the problem Amul reported.  Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Fri, Aug 12, 2016 at 9:22 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sat, Aug 13, 2016 at 8:26 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sat, Aug 13, 2016 at 2:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> amul sul <sul_amul@yahoo.co.in> writes:
>>>> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when
latertry to access the memory.  This happens because of my system does not have enough shm space &  current allocation
indsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file)
toensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times. 
>>>
>>> Wouldn't that just result in a segfault during dsm_create?
>>>
>>> I think probably what you are describing here is kernel misbehavior
>>> akin to memory overcommit.  Maybe it *is* memory overcommit and can
>>> be turned off the same way.  If not, you have material for a kernel
>>> bug fix/enhancement request.
>>
>> [...] But it
>> looks like if we used fallocate or posix_fallocate in the
>> dsm_impl_posix case we'd get a nice ESPC error, instead of
>> success-but-later-SIGBUS-on-access.
>
> Here's a simple test extension that creates jumbo dsm segments, and
> then accesses all pages.  If you ask it to write cheques that your
> Linux 3.10 machine can't cash on unpatched master, it does this:
>
> postgres=# create extension foo;
> CREATE EXTENSION
> postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
> server closed the connection unexpectedly
> ...
> LOG:  server process (PID 15105) was terminated by signal 7: Bus error
>
> If I apply the attached experimental patch I get:
>
> postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
> ERROR:  could not resize shared memory segment
> "/PostgreSQL.1938734921" to 17179869184 bytes: No space left on device
>
> It should probably be refactored a bit to separate the error messages
> for ftruncate and posix_fallocate, and we could possibly use the same
> approach for dsm_impl_mmap instead of that write() loop, but this at
> least demonstrates the problem Amul reported.  Thoughts?

Seems like it could be a reasonable change.  I wonder what happens on
other platforms.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Aug 17, 2016 at 4:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 12, 2016 at 9:22 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sat, Aug 13, 2016 at 8:26 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Sat, Aug 13, 2016 at 2:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> amul sul <sul_amul@yahoo.co.in> writes:
>>>>> When I am calling dsm_create on Linux using the POSIX DSM implementation can succeed, but result in SIGBUS when
latertry to access the memory.  This happens because of my system does not have enough shm space &  current allocation
indsm_impl_posix does not allocate disk blocks[1]. I wonder can we use fallocate system call (i.e. Zero-fill the file)
toensure that all the file space has really been allocated, so that we don't later seg fault when accessing the memory
mapping.But here we will endup by loop calling ‘write’ squillions of times. 
>>>>
>>>> Wouldn't that just result in a segfault during dsm_create?
>>>>
>>>> I think probably what you are describing here is kernel misbehavior
>>>> akin to memory overcommit.  Maybe it *is* memory overcommit and can
>>>> be turned off the same way.  If not, you have material for a kernel
>>>> bug fix/enhancement request.
>>>
>>> [...] But it
>>> looks like if we used fallocate or posix_fallocate in the
>>> dsm_impl_posix case we'd get a nice ESPC error, instead of
>>> success-but-later-SIGBUS-on-access.
>>
>> Here's a simple test extension that creates jumbo dsm segments, and
>> then accesses all pages.  If you ask it to write cheques that your
>> Linux 3.10 machine can't cash on unpatched master, it does this:
>>
>> postgres=# create extension foo;
>> CREATE EXTENSION
>> postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
>> server closed the connection unexpectedly
>> ...
>> LOG:  server process (PID 15105) was terminated by signal 7: Bus error
>>
>> If I apply the attached experimental patch I get:
>>
>> postgres=# select test_dsm(16::bigint * 1024 * 1024 * 1024);
>> ERROR:  could not resize shared memory segment
>> "/PostgreSQL.1938734921" to 17179869184 bytes: No space left on device
>>
>> It should probably be refactored a bit to separate the error messages
>> for ftruncate and posix_fallocate, and we could possibly use the same
>> approach for dsm_impl_mmap instead of that write() loop, but this at
>> least demonstrates the problem Amul reported.  Thoughts?
>
> Seems like it could be a reasonable change.  I wonder what happens on
> other platforms.

FreeBSD 10.3 returns successfully from shm_open and then displays
classic overcommit symptoms when you try to access the memory:

LOG:  server process (PID 22714) was terminated by signal 9: Killed
DETAIL:  Failed process was running: select test_dsm(16::bigint * 1024
* 1024 * 1024);
From OS logs: pid 22714 (postgres), uid 1001, was killed: out of swap space

Unfortunately it doesn't like posix_fallocate on a fd returned by
shm_open, and barfs with ENODEV.  So this would need to be a
Linux-only trick.

I still think it's worth thinking about something along these lines on
Linux only, where holey Swiss tmpfs files can bite you.  Otherwise
disabling overcommit on your OS isn't enough to prevent something
which is really a kind of deferred overcommit with a surprising
failure mode (SIGBUS rather than OOM SIGKILL).

--
Thomas Munro
http://www.enterprisedb.com



On Tue, Aug 16, 2016 at 7:41 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I still think it's worth thinking about something along these lines on
> Linux only, where holey Swiss tmpfs files can bite you.  Otherwise
> disabling overcommit on your OS isn't enough to prevent something
> which is really a kind of deferred overcommit with a surprising
> failure mode (SIGBUS rather than OOM SIGKILL).

Yeah, I am inclined to agree.  I mean, creating a DSM is fairly
heavyweight already, so one extra system call isn't (I hope) a crazy
overhead.  We could test to see how much it slows things down.  But it
may be worth paying the cost even if it ends up being kinda expensive.
We don't really have any way of knowing whether the caller's request
is reasonable relative to the amount of virtual memory available, and
converting a possible SIGBUS into an ereport(ERROR, ...) is a big win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Aug 23, 2016 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Aug 16, 2016 at 7:41 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I still think it's worth thinking about something along these lines on
>> Linux only, where holey Swiss tmpfs files can bite you.  Otherwise
>> disabling overcommit on your OS isn't enough to prevent something
>> which is really a kind of deferred overcommit with a surprising
>> failure mode (SIGBUS rather than OOM SIGKILL).
>
> Yeah, I am inclined to agree.  I mean, creating a DSM is fairly
> heavyweight already, so one extra system call isn't (I hope) a crazy
> overhead.  We could test to see how much it slows things down.  But it
> may be worth paying the cost even if it ends up being kinda expensive.
> We don't really have any way of knowing whether the caller's request
> is reasonable relative to the amount of virtual memory available, and
> converting a possible SIGBUS into an ereport(ERROR, ...) is a big win.

Here's a version of the patch that only does something special if the
following planets are aligned:

* Linux only: for now, there doesn't seem to be any reason to assume
that other operating systems share this file-with-holes implementation
quirk, or that posix_fallocate would work on such a fd, or which errno
values to tolerate if it doesn't.  From what I can tell, Solaris,
FreeBSD etc either don't overcommit or do normal non-stealth
overcommit with the usual out-of-swap failure mode for shm_open
memory, with a way to turn overcommit off.  So I put a preprocessor
test in to do this just for __linux__, and I used "fallocate" (a
non-standard Linux syscall) instead of "posix_fallocate".

* Glibc version >= 2.10: ancient versions and other libc
implementations don't have fallocate, so I put a test into the
configure script.

* Kernel version >= 2.6.23+: the man page says that ancient kernels
don't provide the syscall, and that glibc sets errno to ENOSYS in that
case, so I put a check in to keep calm and carry on.

I don't know if any distros ever shipped with an old enough kernel and
new enough glibc for ENOSYS to happen in the wild; for example RHEL5
had neither kernel nor glibc support, and RHEL6 had both.  I haven't
personally tested that path.

Maybe it would be worth thinking about whether this is a condition
that should cause dsm_create to return NULL rather than ereporting,
depending on a flag along the lines of the existing
DSM_CREATE_NULL_IF_MAXSEGMENTS.  But that could be a separate patch if
it turns out to be useful.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, Aug 23, 2016 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> We could test to see how much it slows things down.  But it
> may be worth paying the cost even if it ends up being kinda expensive.

Here are some numbers from a Xeon E7-8830 @ 2.13GHz running Linux 3.10
running the attached program.  It's fairly noisy and I didn't run
super long tests with many repeats, but the general theme is visible.
If you're actually going to USE the memory, it's only a small extra
cost to have reserved seats.  But if there's a strong chance you'll
never access most of the memory, you might call it expensive.

Segment size 1MB:

base = shm_open + ftruncate + mmap + munmap + close = 5us
base + fallocate = 38us
base + memset = 332us
base + fallocate + memset = 346us

Segment size 1GB:

base = shm_open + ftruncate + mmap + munmap + close = 10032us
base + fallocate = 30774us
base + memset = 602925us
base + fallocate + memset = 655433us

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Mon, Aug 22, 2016 at 8:18 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Aug 23, 2016 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> We could test to see how much it slows things down.  But it
>> may be worth paying the cost even if it ends up being kinda expensive.
>
> Here are some numbers from a Xeon E7-8830 @ 2.13GHz running Linux 3.10
> running the attached program.  It's fairly noisy and I didn't run
> super long tests with many repeats, but the general theme is visible.
> If you're actually going to USE the memory, it's only a small extra
> cost to have reserved seats.  But if there's a strong chance you'll
> never access most of the memory, you might call it expensive.
>
> Segment size 1MB:
>
> base = shm_open + ftruncate + mmap + munmap + close = 5us
> base + fallocate = 38us
> base + memset = 332us
> base + fallocate + memset = 346us
>
> Segment size 1GB:
>
> base = shm_open + ftruncate + mmap + munmap + close = 10032us
> base + fallocate = 30774us
> base + memset = 602925us
> base + fallocate + memset = 655433us

Typical DSM segments for parallel query seem to be much smaller than
1MB.  I just added an elog(NOTICE, ...) to dsm_create to print the
size and ran the regression tests.  I got these results:

+ NOTICE:  dsm_create: 89352
+ NOTICE:  dsm_create: 332664
+ NOTICE:  dsm_create: 86664

So for parallel query we're looking at a hit that is probably in the
range of one-tenth of one millisecond or less, which sees like it's
not really a big deal considering that the typical startup time is 4ms
and, really, at this point, we're aiming to use this primarily for
queries with runtimes in the hundreds of milliseconds and more.  Also,
the code can be arbitrarily fast if it doesn't have to be safe.

Now, for bigger segment sizes, I think there actually could be a
little bit of a noticeable performance hit here, because it's not just
about total elapsed time.  Even if the code eventually touches all of
the memory, it might not touch it all before starting to fire up
workers or whatever else it wants to do with the DSM segment.  But I'm
thinking we still need to bite the bullet and pay the expense, because
crash-and-restart cycles are *really* bad.

Assuming the DSA code you submitted gets committed, that's really
where the hit will be here: you'll be be merrily allocating chunks of
dynamic shared memory until your existing DSM segment fills up, and
then, kaboom, you'll go into the tank for half a second when you try
to do the next allocation, supposing the next segment is 1GB in size.
That's not much fun, especially considering that .  But again, unless
we have a faster way to force the system to allocate the pages, I
think we're just going to have to live with that.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Aug 24, 2016 at 2:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Now, for bigger segment sizes, I think there actually could be a
> little bit of a noticeable performance hit here, because it's not just
> about total elapsed time.  Even if the code eventually touches all of
> the memory, it might not touch it all before starting to fire up
> workers or whatever else it wants to do with the DSM segment.  But I'm
> thinking we still need to bite the bullet and pay the expense, because
> crash-and-restart cycles are *really* bad.

Here is a new rebased version of this patch, primarily to poke this
thread as an unresolved question.  This patch is not committable as is
though: I discovered that parallel query can cause fallocate to return
with errno == EINTR.  I haven't yet investigated whether fallocate is
supposed to be restartable, or signals should be blocked, or something
else is wrong.  Another question is whether the call to ftruncate() is
actually necessary before the call to fallocate().

Unfounded speculation: fallocate() might actually *improve*
performance of DSM segments if your access pattern involves random
access (just to pick an example out of the air, something like...
building a hash table), since it's surely easier to allocate a big
contiguous chunk than a squillion random pages most of which divide an
existing hole into two smaller holes...

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, Jun 28, 2017 at 5:19 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Aug 24, 2016 at 2:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Now, for bigger segment sizes, I think there actually could be a
>> little bit of a noticeable performance hit here, because it's not just
>> about total elapsed time.  Even if the code eventually touches all of
>> the memory, it might not touch it all before starting to fire up
>> workers or whatever else it wants to do with the DSM segment.  But I'm
>> thinking we still need to bite the bullet and pay the expense, because
>> crash-and-restart cycles are *really* bad.
>
> Here is a new rebased version of this patch, primarily to poke this
> thread as an unresolved question.  This patch is not committable as is
> though: I discovered that parallel query can cause fallocate to return
> with errno == EINTR.  I haven't yet investigated whether fallocate is
> supposed to be restartable, or signals should be blocked, or something
> else is wrong.  Another question is whether the call to ftruncate() is
> actually necessary before the call to fallocate().

I think this line is saying that it won't restart automatically:

https://github.com/torvalds/linux/blob/590dce2d4934fb909b112cd80c80486362337744/mm/shmem.c#L2884

Compare this patch (not in the kernel tree) that suggests that line
should be changed to cause restart:

https://lkml.org/lkml/2016/3/3/987

- error = -EINTR;
+ error = -ERESTARTSYS;

So I think we either need to mask signals with or put in an explicit
retry loop, as shown in the attached version of the patch.  With the
v3 patch I posted earlier, I see interrupted system call failures in
the select_parallel regression test, but with the v4 it passes.
Thoughts?

> Unfounded speculation: fallocate() might actually *improve*
> performance of DSM segments if your access pattern involves random
> access (just to pick an example out of the air, something like...
> building a hash table), since it's surely easier to allocate a big
> contiguous chunk than a squillion random pages most of which divide an
> existing hole into two smaller holes...

Bleugh... I retract this, of course we initialise the hash table in
order anyway so this doesn't make any sense.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On 2017-06-28 19:07:50 +1200, Thomas Munro wrote:
> I think this line is saying that it won't restart automatically:
> 
> https://github.com/torvalds/linux/blob/590dce2d4934fb909b112cd80c80486362337744/mm/shmem.c#L2884

Indeed.


> So I think we either need to mask signals with or put in an explicit
> retry loop, as shown in the attached version of the patch.  With the
> v3 patch I posted earlier, I see interrupted system call failures in
> the select_parallel regression test, but with the v4 it passes.
> Thoughts?

I think a retry loop is a lot better than blocking signals.


> > Unfounded speculation: fallocate() might actually *improve*
> > performance of DSM segments if your access pattern involves random
> > access (just to pick an example out of the air, something like...
> > building a hash table), since it's surely easier to allocate a big
> > contiguous chunk than a squillion random pages most of which divide an
> > existing hole into two smaller holes...
> 
> Bleugh... I retract this, of course we initialise the hash table in
> order anyway so this doesn't make any sense.

It's still faster to create larger mappings at once, rather than through
subsequent page faults...


> diff --git a/configure.in b/configure.in
> index 11eb9c8acfc..47452bbac43 100644
> --- a/configure.in
> +++ b/configure.in
> @@ -1429,7 +1429,7 @@ PGAC_FUNC_WCSTOMBS_L
>  LIBS_including_readline="$LIBS"
>  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
>  
> -AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat
pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs
wcstombs_l])
> +AC_CHECK_FUNCS([cbrt clock_gettime dlopen fallocate fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove
pollpstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes
wcstombswcstombs_l])
 

Why are we going for fallocate rather than posix_fallocate()? There's
plenty use cases that can only be done with the former, but this doesn't
seem like one of them?

Currently this is a linux specific path - but I don't actually see any
reason to keep it that way? It seems far from unlikely that this is just
an issue with linux...

>
>  #ifdef USE_DSM_POSIX
>  /*
> + * Set the size of a virtual memory region associate with a file descriptor.
> + * On Linux, also ensure that virtual memory is actually allocated by the
> + * operating system to avoid nasty surprises later.
> + *
> + * Returns non-zero if either truncation or allocation fails, and sets errno.
> + */
> +static int
> +dsm_impl_posix_resize(int fd, off_t size)
> +{
> +        int rc;
> +
> +        /* Truncate (or extend) the file to the requested size. */
> +        rc = ftruncate(fd, size);
> +
> +#ifdef HAVE_FALLOCATE
> +#ifdef __linux__
> +        /*
> +         * On Linux, a shm_open fd is backed by a tmpfs file.  After resizing
> +         * with ftruncate it may contain a hole.  Accessing memory backed by a
> +         * hole causes tmpfs to allocate pages, which fails with SIGBUS if
> +         * there is no virtual memory available.  So we ask tmpfs to allocate
> +         * pages here, so we can fail gracefully with ENOSPC now rather than
> +         * risking SIGBUS later.
> +         */
> +        if (rc == 0)
> +        {
> +            do
> +            {
> +                rc = fallocate(fd, 0, 0, size);
> +            } while (rc == -1 && errno == EINTR);
> +            if (rc != 0 && errno == ENOSYS)
> +            {
> +                /* Kernel too old (< 2.6.23). */
> +                rc = 0;
> +            }
> +        }
> +#endif
> +#endif

I'd rather fall-back to just write() initializing the buffer, and do
either of those in all implementations rather than just linux.

> diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
> index 7a05c7e5b85..dcb9a846c7b 100644
> --- a/src/include/pg_config.h.in
> +++ b/src/include/pg_config.h.in
> @@ -167,6 +167,9 @@
>  /* Define to 1 if you have the <editline/readline.h> header file. */
>  #undef HAVE_EDITLINE_READLINE_H
>  
> +/* Define to 1 if you have the `fallocate' function. */
> +#undef HAVE_FALLOCATE
> +
>  /* Define to 1 if you have the `fdatasync' function. */
>  #undef HAVE_FDATASYNC

Needs pg_config.h.win32 adjustment...

Greetings,

Andres Freund



On Thu, Jun 29, 2017 at 11:04 AM, Andres Freund <andres@anarazel.de> wrote:
>> diff --git a/configure.in b/configure.in
>> index 11eb9c8acfc..47452bbac43 100644
>> --- a/configure.in
>> +++ b/configure.in
>> @@ -1429,7 +1429,7 @@ PGAC_FUNC_WCSTOMBS_L
>>  LIBS_including_readline="$LIBS"
>>  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
>>
>> -AC_CHECK_FUNCS([cbrt clock_gettime dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat
pthread_is_threaded_npreadlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes wcstombs
wcstombs_l])
>> +AC_CHECK_FUNCS([cbrt clock_gettime dlopen fallocate fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove
pollpstat pthread_is_threaded_np readlink setproctitle setsid shm_open symlink sync_file_range towlower utime utimes
wcstombswcstombs_l]) 
>
> Why are we going for fallocate rather than posix_fallocate()? There's
> plenty use cases that can only be done with the former, but this doesn't
> seem like one of them?
>
> Currently this is a linux specific path - but I don't actually see any
> reason to keep it that way? It seems far from unlikely that this is just
> an issue with linux...

POSIX says that posix_fallocate() only works on files.  It's an
implementation detail that shm_open() returns an fd for a tmpfs file
on Linux.  posix_fallocate() on an fd returned by shm_open() fails
under FreeBSD, and I suspect on other Unices too (but haven't tried),
because there is no requirement for POSIX shm_open() objects to be so
file-ish.  My take is that this is a Linux-specific problem requiring
a Linux-specific solution.    Considering the above and the advice on
LKML from a senior kernel hacker[1] to use fallocate() to solve this
problem, I figured that was the right way to go.  But yeah, it also
seems to work with posix_fallocate() *on Linux*.

>>
>>  #ifdef USE_DSM_POSIX
>>  /*
>> + * Set the size of a virtual memory region associate with a file descriptor.
>> + * On Linux, also ensure that virtual memory is actually allocated by the
>> + * operating system to avoid nasty surprises later.
>> + *
>> + * Returns non-zero if either truncation or allocation fails, and sets errno.
>> + */
>> +static int
>> +dsm_impl_posix_resize(int fd, off_t size)
>> +{
>> +             int rc;
>> +
>> +             /* Truncate (or extend) the file to the requested size. */
>> +             rc = ftruncate(fd, size);
>> +
>> +#ifdef HAVE_FALLOCATE
>> +#ifdef __linux__
>> +             /*
>> +              * On Linux, a shm_open fd is backed by a tmpfs file.  After resizing
>> +              * with ftruncate it may contain a hole.  Accessing memory backed by a
>> +              * hole causes tmpfs to allocate pages, which fails with SIGBUS if
>> +              * there is no virtual memory available.  So we ask tmpfs to allocate
>> +              * pages here, so we can fail gracefully with ENOSPC now rather than
>> +              * risking SIGBUS later.
>> +              */
>> +             if (rc == 0)
>> +             {
>> +                     do
>> +                     {
>> +                             rc = fallocate(fd, 0, 0, size);
>> +                     } while (rc == -1 && errno == EINTR);
>> +                     if (rc != 0 && errno == ENOSYS)
>> +                     {
>> +                             /* Kernel too old (< 2.6.23). */
>> +                             rc = 0;
>> +                     }
>> +             }
>> +#endif
>> +#endif
>
> I'd rather fall-back to just write() initializing the buffer, and do
> either of those in all implementations rather than just linux.

You can't write() to a shm_open() fd on FreeBSD.  I can't find wording
in POSIX either way but it's not portable, and it's not too surprising
if you consider the spirit of POSIX shm_open() to be something like
"you might want to implement this as a special kind of file, or not"
but with the intention being to mmap it, not read/write it.

Note that I'm specifically addressing the freaky Linux SIGBUS problem
here: not virtual memory overcommit, which may or may not happen on
your particular OS leading to failures independent of this, and for
example does happen with FreeBSD.  The way to avoid that on FreeBSD is
probably to disable overcommit (I haven't checked).  The Linux SIGBUS
problem is not regular overcommit, it's tmpfs file holes and is a
different failure mode and AFAIK can bite you even if you disable
overcommit on Linux like TFM tells you to.

>> diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
>> index 7a05c7e5b85..dcb9a846c7b 100644
>> --- a/src/include/pg_config.h.in
>> +++ b/src/include/pg_config.h.in
>> @@ -167,6 +167,9 @@
>>  /* Define to 1 if you have the <editline/readline.h> header file. */
>>  #undef HAVE_EDITLINE_READLINE_H
>>
>> +/* Define to 1 if you have the `fallocate' function. */
>> +#undef HAVE_FALLOCATE
>> +
>>  /* Define to 1 if you have the `fdatasync' function. */
>>  #undef HAVE_FDATASYNC
>
> Needs pg_config.h.win32 adjustment...

Oops, didn't notice that we maintain that thing by hand.  New version attached.

[1] https://lkml.org/lkml/2013/7/31/64

--
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Thu, Jun 29, 2017 at 12:24 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> fallocate-v5.patch

Added to commitfest so we don't lose track of this.

I'm mainly concerned about the fact that we have a way for PostgreSQL
to die that looks exactly like a bug, when really it's masking an
out-of-memory condition that a DBA or sysadmin would normally be able
to diagnose by the appearance of the OOM reaper at the window holding
a scythe.  I suppose the SIGBUS case is much more likely to happen
when we start actively using large amounts of dynamic shared memory
(parallel hash), but I suppose it could happen now if your system is
already overcommitted and a small DSM segment happens to push you over
the edge.

-- 
Thomas Munro
http://www.enterprisedb.com



On Thu, Aug 17, 2017 at 11:39 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Jun 29, 2017 at 12:24 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> fallocate-v5.patch
>
> Added to commitfest so we don't lose track of this.

Rebased due to collision with recent configure.in adjustments.  I also
wrote a commit message and retested with create-dsm-test.patch (from
upthread).

So, do we want this patch?  It's annoying to expend cycles doing this,
but it only really hurts if you allocate a lot of DSM space that you
never actually use.  If that ever becomes a serious problem, perhaps
that'll be a sign that we should be reusing the space between queries
anyway?

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> So, do we want this patch?

I think we don't really have a lot of choice.  I propose applying this
as far back as 9.6 --- anyone think differently?
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Mon, Sep 25, 2017 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> So, do we want this patch?
>
> I think we don't really have a lot of choice.  I propose applying this
> as far back as 9.6 --- anyone think differently?

+1.  If applies to 9.5 and 9.4 without a lot of work, I think we
should apply it there as well, in case there is out-of-core use of DSM
(or, in 9.5, also of ParallelContext).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 25, 2017 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>>> So, do we want this patch?

>> I think we don't really have a lot of choice.  I propose applying this
>> as far back as 9.6 --- anyone think differently?

> +1.  If applies to 9.5 and 9.4 without a lot of work, I think we
> should apply it there as well, in case there is out-of-core use of DSM
> (or, in 9.5, also of ParallelContext).

I'll take a look, but I doubt it's worth much extra effort (and you
seem to agree).
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Sep 25, 2017 at 10:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think we don't really have a lot of choice.  I propose applying this
>> as far back as 9.6 --- anyone think differently?

> +1.  If applies to 9.5 and 9.4 without a lot of work, I think we
> should apply it there as well, in case there is out-of-core use of DSM
> (or, in 9.5, also of ParallelContext).

Hmm, so I tested this patch on my RHEL6 box (kernel 2.6.32) and it
immediately fell over with

2017-09-25 14:23:48.410 EDT [325] FATAL:  could not resize shared memory segment "/PostgreSQL.1682054886" to 6928
bytes:Operation not supported 

during startup.  I wonder whether we need to round off the request.
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

I wrote:
> Hmm, so I tested this patch on my RHEL6 box (kernel 2.6.32) and it
> immediately fell over with
> 2017-09-25 14:23:48.410 EDT [325] FATAL:  could not resize shared memory segment "/PostgreSQL.1682054886" to 6928
bytes:Operation not supported 
> during startup.  I wonder whether we need to round off the request.

Nope, rounding off doesn't help.  What does help is using posix_fallocate
instead.  I surmise that glibc knows something we don't about how to call
fallocate(2) successfully on this kernel version.

Rather than dig into the guts of glibc to find that out, though, I think
we should just s/fallocate/posix_fallocate/g on this patch.  The argument
for using the former seemed pretty thin to begin with.
        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

I wrote:
> Rather than dig into the guts of glibc to find that out, though, I think
> we should just s/fallocate/posix_fallocate/g on this patch.  The argument
> for using the former seemed pretty thin to begin with.

Pushed with that change; we'll soon see what the buildfarm thinks.

I suspect that the provisions for EINTR and ENOSYS errors may now be
dead code, since the Linux man page for posix_fallocate mentions
neither.  They're not much code though, and POSIX itself *does*
list EINTR, so I'm hesitant to muck with that.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Sep 26, 2017 at 7:56 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Hmm, so I tested this patch on my RHEL6 box (kernel 2.6.32) and it
>> immediately fell over with
>> 2017-09-25 14:23:48.410 EDT [325] FATAL:  could not resize shared memory segment "/PostgreSQL.1682054886" to 6928
bytes:Operation not supported
 
>> during startup.  I wonder whether we need to round off the request.
>
> Nope, rounding off doesn't help.  What does help is using posix_fallocate
> instead.  I surmise that glibc knows something we don't about how to call
> fallocate(2) successfully on this kernel version.
>
> Rather than dig into the guts of glibc to find that out, though, I think
> we should just s/fallocate/posix_fallocate/g on this patch.  The argument
> for using the former seemed pretty thin to begin with.

I didn't dig into the glibc source but my first guess is that
posix_fallocate() sees ENOTSUPP (from tmpfs on that vintage kernel)
and falls back to a write loop.  I thought I was doing enough by
testing for ENOSYS (based on the man page for fallocate which said
that should be expected on kernels < 2.6.23), but I see now that
ENOTSUPP is possible per-filesystem implementation and tmpfs support
was added some time after the 2.6.32 era:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e2d12e22c59ce714008aa5266d769f8568d74eac

I'm not sure why glibc would provide a fallback for posix_fallocate()
but let ENOTSUPP escape from fallocate().

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Sep 26, 2017 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Rather than dig into the guts of glibc to find that out, though, I think
>> we should just s/fallocate/posix_fallocate/g on this patch.  The argument
>> for using the former seemed pretty thin to begin with.
>
> Pushed with that change; we'll soon see what the buildfarm thinks.

Thanks.

> I suspect that the provisions for EINTR and ENOSYS errors may now be
> dead code, since the Linux man page for posix_fallocate mentions
> neither.  They're not much code though, and POSIX itself *does*
> list EINTR, so I'm hesitant to muck with that.

Ah, it all makes sense now that I see the fallback strategy section of
the posix_fallocate() man page.  I was unaware that there were kernel
releases that had the syscall but lacked support in tmpfs.  Thanks for
testing and fixing that.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

> On Tue, Sep 26, 2017 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Pushed with that change; we'll soon see what the buildfarm thinks.

Hmm.  One failure in the test modules:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2017-09-25%2020%3A45%3A02

2017-09-25 13:51:00.753 PDT [9107:6] LOG:  worker process: test_shm_mq
(PID 9362) exited with exit code 1
2017-09-25 13:51:00.753 PDT [9360:7] ERROR:  could not resize shared
memory segment "/PostgreSQL.1896677221" to 65824 bytes: Success

hostname = centos7x.joeconway.com
uname -r = 3.10.0-229.11.1.el7.x86_64

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Sep 26, 2017 at 9:57 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> On Tue, Sep 26, 2017 at 9:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Pushed with that change; we'll soon see what the buildfarm thinks.
>
> Hmm.  One failure in the test modules:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rhinoceros&dt=2017-09-25%2020%3A45%3A02
>
> 2017-09-25 13:51:00.753 PDT [9107:6] LOG:  worker process: test_shm_mq
> (PID 9362) exited with exit code 1
> 2017-09-25 13:51:00.753 PDT [9360:7] ERROR:  could not resize shared
> memory segment "/PostgreSQL.1896677221" to 65824 bytes: Success
>
> hostname = centos7x.joeconway.com
> uname -r = 3.10.0-229.11.1.el7.x86_64

I think the problem here is that posix_fallocate() doesn't set errno.
It returns an errno-like value.  This is a difference from
fallocate().  Will write a patch.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> I think the problem here is that posix_fallocate() doesn't set errno.

Huh.  So the fact that it worked for me is likely because glibc's
emulation *does* allow errno to get set.

> Will write a patch.

Thanks, I'm out of time for today.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On Tue, Sep 26, 2017 at 10:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> I think the problem here is that posix_fallocate() doesn't set errno.
>
> Huh.  So the fact that it worked for me is likely because glibc's
> emulation *does* allow errno to get set.
>
>> Will write a patch.
>
> Thanks, I'm out of time for today.

See attached, which also removes the ENOSYS stuff which I believe to
be now useless.  Does this make sense?  Survives make check-world and
my simple test procedure on a
3.10.0-327.36.1.el7.x86_64 system.

postgres=# select test_dsm(1000000000000);
ERROR:  could not resize shared memory segment
"/PostgreSQL.2043796572" to 1000000000000 bytes: No space left on
device
postgres=# select test_dsm(1000);
 test_dsm
----------

(1 row)

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> See attached, which also removes the ENOSYS stuff which I believe to
> be now useless.  Does this make sense?  Survives make check-world and
> my simple test procedure on a 3.10.0-327.36.1.el7.x86_64 system.

Thanks.  Works on my RHEL6 box too, so pushed.

This certainly explains the "could not resize ...: Success" oddity
we see in the buildfarm.  It's not so clear why those critters are
failing in the first place.  I hope that what is happening is that
they are getting EINTR results and failing to retry.  (Although
my RHEL6 man page doesn't mention EINTR as a possible failure,
I see that's been corrected in later editions.)  If not, we should
at least get a better fix on the nature of the failure with this
patch.  Awaiting buildfarm results ...
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers