Re: dynamic shared memory - Mailing list pgsql-hackers

From Robert Haas
Subject Re: dynamic shared memory
Date
Msg-id CA+Tgmob8vX+zCoxnif-SXzpZUVfQpcMBec6oV4Pg7p+VUHK+tw@mail.gmail.com
Whole thread Raw
In response to Re: dynamic shared memory  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: dynamic shared memory  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On Tue, Aug 27, 2013 at 10:07 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> [just sending an email which sat in my outbox for two weeks]

Thanks for taking a look.

> Nice to see this coming. I think it will actually be interesting for
> quite some things outside parallel query, but we'll see.

Yeah, I hope so.  The applications may be somewhat limited by the fact
that there are apparently fairly small limits to how many shared
memory segments you can map at the same time.  I believe on one system
I looked at (some version of HP-UX?) the limit was 11.  So we won't be
able to go nuts with this: using it definitely introduces all kinds of
failure modes that we don't have it today.  But it will also let us do
some pretty cool things that we CAN'T do today.

>> To help solve these problems, I invented something called the "dynamic
>> shared memory control segment".  This is a dynamic shared memory
>> segment created at startup (or reinitialization) time by the
>> postmaster before any user process are created.  It is used to store a
>> list of the identities of all the other dynamic shared memory segments
>> we have outstanding and the reference count of each.  If the
>> postmaster goes through a crash-and-reset cycle, it scans the control
>> segment and removes all the other segments mentioned there, and then
>> recreates the control segment itself.  If the postmaster is killed off
>> (e.g. kill -9) and restarted, it locates the old control segment and
>> proceeds similarly.
>
> That way any corruption in that area will prevent restarts without
> reboot unless you use ipcrm, or such, right?

The way I've designed it, no.  If what we expect to be the control
segment doesn't exist or doesn't conform to our expectations, we just
assume that it's not really the control segment after all - e.g.
someone rebooted, clearing all the segments, and then an unrelated
process (malicious, perhaps, or just a completely different cluster)
reused the same name.  This is similar to what we do for the main
shared memory segment.

>> Creating a shared memory segment is a somewhat operating-system
>> dependent task.  I decided that it would be smart to support several
>> different implementations and to let the user choose which one they'd
>> like to use via a new GUC, dynamic_shared_memory_type.
>
> I think we want that during development, but I'd rather not go there
> when releasing. After all, we don't support a manual choice between
> anonymous mmap/sysv shmem either.

That's true, but that decision has not been uncontroversial - e.g. the
NetBSD guys don't like it, because they have a big performance
difference between those two types of memory.  We have to balance the
possible harm of one more setting against the benefit of letting
people do what they want without needing to recompile or modify code.

>> In addition, I've included an implementation based on mmap of a plain
>> file.  As compared with a true shared memory implementation, this
>> obviously has the disadvantage that the OS may be more likely to
>> decide to write back dirty pages to disk, which could hurt
>> performance.  However, I believe it's worthy of inclusion all the
>> same, because there are a variety of situations in which it might be
>> more convenient than one of the other implementations.  One is
>> debugging.
>
> Hm. Not sure what's the advantage over a corefile here.

You can look at it while the server's running.

>> On MacOS X, for example, there seems to be no way to list
>> POSIX shared memory segments, and no easy way to inspect the contents
>> of either POSIX or System V shared memory segments.
>
> Shouldn't we ourselves know which segments are around?

Sure, that's the point of the control segment.  But listing a
directory is a lot easier than figuring out what the current control
segment contents are.

>> Another use case
>> is working around an administrator-imposed or OS-imposed shared memory
>> limit.  If you're not allowed to allocate shared memory, but you are
>> allowed to create files, then this implementation will let you use
>> whatever facilities we build on top of dynamic shared memory anyway.
>
> I don't think we should try to work around limits like that.

I do.  There's probably someone, somewhere in the world who thinks
that operating system shared memory limits are a good idea, but I have
not met any such person.  There are multiple ways to create shared
memory, and they all have different limits.  Normally, System V limits
are small, POSIX limits are large, and the inherited-anonymous-mapping
trick we're now using for the main shared memory segment has no limits
at all.  It's very common to run into a system where you can allocate
huge numbers of gigabytes of backend-private memory, but if you try to
allocate 64MB of *shared* memory, you get the axe - or maybe not,
depending on which API you use to create it.

I would never advocate deliberately trying to circumvent a
carefully-considered OS-level policy decision about resource
utilization, but I don't think that's the dynamic here.  I think if we
insist on predetermining the dynamic shared memory implementation
based on the OS, we'll just be inconveniencing people needlessly, or
flat-out making things not work.  I think this case is roughly similar
to wal_sync_method: there really shouldn't be a performance or
reliability difference between the ~6 ways of flushing a file to disk,
but as it turns out, there is, so we have an option.  If we're SURE
that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
100% of cases, and that a NetBSD user will always prefer "sysv" over
"mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
But I'm not that sure.

> It wouldn't even work. Several mappings of /dev/zero et al. do *not*
> result in the same virtual memory being mapped. Not even when using the
> same (passed around) fd.
> Believe me, I tried ;)

OK, well that's another reason I didn't do it that way, then.  :-)

> At this point I am rather unconcerned with this point to be
> honest.

I think that's appropriate; mostly, I wanted to emphasize that the
wisdom of allocating any given amount of shared memory is outside the
scope of this patch, which only aims to provide mechanism, not policy.

> Why do we want to expose something unreliable as preferred_address to
> the external interface? I haven't read the code yet, so I might be
> missing something here.

I shared your opinion that preferred_address is never going to be
reliable, although FWIW Noah thinks it can be made reliable with a
large-enough hammer.  But even if it isn't reliable, there doesn't
seem to be all that much value in forbidding access to that part of
the OS-provided API.   In the world where it's not reliable, it may
still be convenient to map things at the same address when you can, so
that pointers can't be used.  Of course you'd have to have some
fallback strategy for when you don't get the same mapping, and maybe
that's painful enough that there's no point after all.  Or maybe it's
worth having one code path for relativized pointers and another for
non-relativized pointers.

To be honest, I'm not real sure.  I think it's clear enough that this
will meet the minimal requirements for parallel query - ONE dynamic
shared memory segment that's not guaranteed to be at the same address
in every backend, and can't be resized after creation.  And we could
pare the API down to only support that.  But I'd rather get some
experience with this first before we start taking away options.
Otherwise, we may never really find out the limits of what is possible
in this area, and I think that would be a shame.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])
Next
From: Alvaro Herrera
Date:
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])