Re: dynamic shared memory - Mailing list pgsql-hackers

From Andres Freund
Subject Re: dynamic shared memory
Date
Msg-id 20130830154539.GK5019@alap2.anarazel.de
Whole thread Raw
In response to Re: dynamic shared memory  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: dynamic shared memory
Re: dynamic shared memory
List pgsql-hackers
Hi,

On 2013-08-28 15:20:57 -0400, Robert Haas wrote:
> > That way any corruption in that area will prevent restarts without
> > reboot unless you use ipcrm, or such, right?
> 
> The way I've designed it, no.  If what we expect to be the control
> segment doesn't exist or doesn't conform to our expectations, we just
> assume that it's not really the control segment after all - e.g.
> someone rebooted, clearing all the segments, and then an unrelated
> process (malicious, perhaps, or just a completely different cluster)
> reused the same name.  This is similar to what we do for the main
> shared memory segment.

The case I am mostly wondering about is some process crashing and
overwriting random memory. We need to be pretty sure that we'll never
fail partially through cleaning up old segments because they are
corrupted or because we died halfway through our last cleanup attempt.

> > I think we want that during development, but I'd rather not go there
> > when releasing. After all, we don't support a manual choice between
> > anonymous mmap/sysv shmem either.

> That's true, but that decision has not been uncontroversial - e.g. the
> NetBSD guys don't like it, because they have a big performance
> difference between those two types of memory.  We have to balance the
> possible harm of one more setting against the benefit of letting
> people do what they want without needing to recompile or modify code.

But then, it made them fix the issue afaik :P

> >> In addition, I've included an implementation based on mmap of a plain
> >> file.  As compared with a true shared memory implementation, this
> >> obviously has the disadvantage that the OS may be more likely to
> >> decide to write back dirty pages to disk, which could hurt
> >> performance.  However, I believe it's worthy of inclusion all the
> >> same, because there are a variety of situations in which it might be
> >> more convenient than one of the other implementations.  One is
> >> debugging.
> >
> > Hm. Not sure what's the advantage over a corefile here.

> You can look at it while the server's running.

That's what debuggers are for.

> >> On MacOS X, for example, there seems to be no way to list
> >> POSIX shared memory segments, and no easy way to inspect the contents
> >> of either POSIX or System V shared memory segments.

> > Shouldn't we ourselves know which segments are around?

> Sure, that's the point of the control segment.  But listing a
> directory is a lot easier than figuring out what the current control
> segment contents are.

But without a good amount of tooling - like in a debugger... - it's not
very interesting to look at those files either way? The mere presence of
a segment doesn't tell you much and the contents won't be easily
readable.

> >> Another use case is working around an administrator-imposed or
> >> OS-imposed shared memory limit.  If you're not allowed to allocate
> >> shared memory, but you are allowed to create files, then this
> >> implementation will let you use whatever facilities we build on top
> >> of dynamic shared memory anyway.
> >
> > I don't think we should try to work around limits like that.

> I do.  There's probably someone, somewhere in the world who thinks
> that operating system shared memory limits are a good idea, but I have
> not met any such person.

"Let's drive users away from sysv shem" is the only one I heard so far ;)

> I would never advocate deliberately trying to circumvent a
> carefully-considered OS-level policy decision about resource
> utilization, but I don't think that's the dynamic here.  I think if we
> insist on predetermining the dynamic shared memory implementation
> based on the OS, we'll just be inconveniencing people needlessly, or
> flat-out making things not work. [...]

But using file-backed memory will *suck* performancewise. Why should we
ever want to offer that to a user? That's what I was arguing about
primarily.

> If we're SURE
> that a Linux user will prefer "posix" to "sysv" or "mmap" or "none" in
> 100% of cases, and that a NetBSD user will always prefer "sysv" over
> "mmap" or "none" in 100% of cases, then, OK, sure, let's bake it in.
> But I'm not that sure.

I think posix shmem will be preferred to sysv shmem if present, in just
about any relevant case. I don't know of any system with lower limits on
posix shmem than on sysv.

> I think this case is roughly similar
> to wal_sync_method: there really shouldn't be a performance or
> reliability difference between the ~6 ways of flushing a file to disk,
> but as it turns out, there is, so we have an option.

Well, most of them actually give different guarantees, so it makes sense
to have differing performance...

> > Why do we want to expose something unreliable as preferred_address to
> > the external interface? I haven't read the code yet, so I might be
> > missing something here.

> I shared your opinion that preferred_address is never going to be
> reliable, although FWIW Noah thinks it can be made reliable with a
> large-enough hammer.

I think we need to have the arguments for that on list then. Those are
pretty damn fundamental design decisions.
I for one cannot see how you even remotely could make that work a) on
windows (check the troubles we have to go through to get s_b
consistently placed, and that's directly after startup) b) 32bit systems.

> But even if it isn't reliable, there doesn't seem to be all that much
> value in forbidding access to that part of the OS-provided API.  In
> the world where it's not reliable, it may still be convenient to map
> things at the same address when you can, so that pointers can't be
> used.  Of course you'd have to have some fallback strategy for when
> you don't get the same mapping, and maybe that's painful enough that
> there's no point after all.  Or maybe it's worth having one code path
> for relativized pointers and another for non-relativized pointers.

It seems likely to me that will end up with untested code in that
case. Or even unsupported platforms.

> To be honest, I'm not real sure.  I think it's clear enough that this
> will meet the minimal requirements for parallel query - ONE dynamic
> shared memory segment that's not guaranteed to be at the same address
> in every backend, and can't be resized after creation.  And we could
> pare the API down to only support that.  But I'd rather get some
> experience with this first before we start taking away options.
> Otherwise, we may never really find out the limits of what is possible
> in this area, and I think that would be a shame.

On the other hand, adding capabilities annoys people far much than
deciding that we can't support them in the end and taking them away.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: logical changeset generation v5
Next
From: Andres Freund
Date:
Subject: Add database to PGXACT / per database vacuuming