Re: dynamic shared memory - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: dynamic shared memory
Date
Msg-id CAA4eK1LQzophhQyEXt7WFjaPv40oOVMbiA_+7X1=PZvafpGebA@mail.gmail.com
Whole thread Raw
In response to Re: dynamic shared memory  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Fri, Aug 30, 2013 at 9:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
>> > That way any corruption in that area will prevent restarts without
>> > reboot unless you use ipcrm, or such, right?
>>
>> The way I've designed it, no.  If what we expect to be the control
>> segment doesn't exist or doesn't conform to our expectations, we just
>> assume that it's not really the control segment after all - e.g.
>> someone rebooted, clearing all the segments, and then an unrelated
>> process (malicious, perhaps, or just a completely different cluster)
>> reused the same name.  This is similar to what we do for the main
>> shared memory segment.
>
> The case I am mostly wondering about is some process crashing and
> overwriting random memory. We need to be pretty sure that we'll never
> fail partially through cleaning up old segments because they are
> corrupted or because we died halfway through our last cleanup attempt.
>
>> > I think we want that during development, but I'd rather not go there
>> > when releasing. After all, we don't support a manual choice between
>> > anonymous mmap/sysv shmem either.
>
>> That's true, but that decision has not been uncontroversial - e.g. the
>> NetBSD guys don't like it, because they have a big performance
>> difference between those two types of memory.  We have to balance the
>> possible harm of one more setting against the benefit of letting
>> people do what they want without needing to recompile or modify code.
>
> But then, it made them fix the issue afaik :P
>
>> >> In addition, I've included an implementation based on mmap of a plain
>> >> file.  As compared with a true shared memory implementation, this
>> >> obviously has the disadvantage that the OS may be more likely to
>> >> decide to write back dirty pages to disk, which could hurt
>> >> performance.  However, I believe it's worthy of inclusion all the
>> >> same, because there are a variety of situations in which it might be
>> >> more convenient than one of the other implementations.  One is
>> >> debugging.
>> >
>> > Hm. Not sure what's the advantage over a corefile here.
>
>> You can look at it while the server's running.
>
> That's what debuggers are for.
>
>> >> On MacOS X, for example, there seems to be no way to list
>> >> POSIX shared memory segments, and no easy way to inspect the contents
>> >> of either POSIX or System V shared memory segments.
>
>> > Shouldn't we ourselves know which segments are around?
>
>> Sure, that's the point of the control segment.  But listing a
>> directory is a lot easier than figuring out what the current control
>> segment contents are.
>
> But without a good amount of tooling - like in a debugger... - it's not
> very interesting to look at those files either way? The mere presence of
> a segment doesn't tell you much and the contents won't be easily
> readable.
>
>> >> Another use case is working around an administrator-imposed or
>> >> OS-imposed shared memory limit.  If you're not allowed to allocate
>> >> shared memory, but you are allowed to create files, then this
>> >> implementation will let you use whatever facilities we build on top
>> >> of dynamic shared memory anyway.
>> >
>> > I don't think we should try to work around limits like that.
>
>> I do.  There's probably someone, somewhere in the world who thinks
>> that operating system shared memory limits are a good idea, but I have
>> not met any such person.
>
> "Let's drive users away from sysv shem" is the only one I heard so far ;)
>
>> I would never advocate deliberately trying to circumvent a
>> carefully-considered OS-level policy decision about resource
>> utilization, but I don't think that's the dynamic here.  I think if we
>> insist on predetermining the dynamic shared memory implementation
>> based on the OS, we'll just be inconveniencing people needlessly, or
>> flat-out making things not work. [...]
>
> But using file-backed memory will *suck* performancewise. Why should we
> ever want to offer that to a user? That's what I was arguing about
> primarily.
>
>> If we're SURE
>> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
>> 100% of cases, and that a NetBSD user will always prefer "sysv" over
>> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
>> But I'm not that sure.
>
> I think posix shmem will be preferred to sysv shmem if present, in just
> about any relevant case. I don't know of any system with lower limits on
> posix shmem than on sysv.
>
>> I think this case is roughly similar
>> to wal_sync_method: there really shouldn't be a performance or
>> reliability difference between the ~6 ways of flushing a file to disk,
>> but as it turns out, there is, so we have an option.
>
> Well, most of them actually give different guarantees, so it makes sense
> to have differing performance...
>
>> > Why do we want to expose something unreliable as preferred_address to
>> > the external interface? I haven't read the code yet, so I might be
>> > missing something here.
>
>> I shared your opinion that preferred_address is never going to be
>> reliable, although FWIW Noah thinks it can be made reliable with a
>> large-enough hammer.
>
> I think we need to have the arguments for that on list then. Those are
> pretty damn fundamental design decisions.
> I for one cannot see how you even remotely could make that work a) on
> windows (check the troubles we have to go through to get s_b
> consistently placed, and that's directly after startup) b) 32bit systems.
 For Windows, I believe we are already doing something similar
(attaching at predefined address) in main shared memory. It reserves memory at particular address using
pgwin32_ReserveSharedMemoryRegion() before actually starting (resuming process created in suspend mode) a process and
then after starting backend attaches at same address (PGSharedMemoryReAttach).
 I think one question here is what is use of exposing
preffered_address, to which I can think of only below:
 a. Base OS API's provide such provision, then why don't we? b. While browsing, I found few examples in IBM site where
theyalso
 
show usage with preferred address.
http://publib.boulder.ibm.com/infocenter/comphelp/v7v91/index.jsp?topic=%2Fcom.ibm.vacpp7a.doc%2Fproguide%2Fref%2Fcreate_heap.htm
c. If user wishes to attach segments at same base address, so that
 
it can access pointers in the memory mapped     file which otherwise would not be possible.

>> But even if it isn't reliable, there doesn't seem to be all that much
>> value in forbidding access to that part of the OS-provided API.  In
>> the world where it's not reliable, it may still be convenient to map
>> things at the same address when you can, so that pointers can't be
>> used.  Of course you'd have to have some fallback strategy for when
>> you don't get the same mapping, and maybe that's painful enough that
>> there's no point after all.  Or maybe it's worth having one code path
>> for relativized pointers and another for non-relativized pointers.
>
> It seems likely to me that will end up with untested code in that
> case. Or even unsupported platforms.
>
>> To be honest, I'm not real sure.  I think it's clear enough that this
>> will meet the minimal requirements for parallel query - ONE dynamic
>> shared memory segment that's not guaranteed to be at the same address
>> in every backend, and can't be resized after creation.  And we could
>> pare the API down to only support that.  But I'd rather get some
>> experience with this first before we start taking away options.
>> Otherwise, we may never really find out the limits of what is possible
>> in this area, and I think that would be a shame.
>
> On the other hand, adding capabilities annoys people far much than
> deciding that we can't support them in the end and taking them away.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Kohei KaiGai
Date:
Subject: Re: [v9.4] row level security
Next
From: Kohei KaiGai
Date:
Subject: Re: [v9.4] row level security