Re: First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance

From Kevin Brown
Subject Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Date
Msg-id 20041014202531.GD665@filer
Whole thread Raw
In response to Re: First set of OSDL Shared Mem scalability results, some wierdness ...  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: First set of OSDL Shared Mem scalability results, some wierdness ...  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Tom Lane wrote:
> >> mmap() is Right Out because it does not afford us sufficient control
> >> over when changes to the in-memory data will propagate to disk.
>
> > ... that's especially true if we simply cannot
> > have the page written to disk in a partially-modified state (something
> > I can easily see being an issue for the WAL -- would the same hold
> > true of the index/data files?).
>
> You're almost there.  Remember the fundamental WAL rule: log entries
> must hit disk before the data changes they describe.  That means that we
> need not only a way of forcing changes to disk (fsync) but a way of
> being sure that changes have *not* gone to disk yet.  In the existing
> implementation we get that by just not issuing write() for a given page
> until we know that the relevant WAL log entries are fsync'd down to
> disk.  (BTW, this is what the LSN field on every page is for: it tells
> the buffer manager the latest WAL offset that has to be flushed before
> it can safely write the page.)
>
> mmap provides msync which is comparable to fsync, but AFAICS it
> provides no way to prevent an in-memory change from reaching disk too
> soon.  This would mean that WAL entries would have to be written *and
> flushed* before we could make the data change at all, which would
> convert multiple updates of a single page into a series of write-and-
> wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
> is bad enough, once per atomic action is intolerable.

Hmm...something just occurred to me about this.

Would a hybrid approach be possible?  That is, use mmap() to handle
reads, and use write() to handle writes?

Any code that wishes to write to a page would have to recognize that
it's doing so and fetch a copy from the storage manager (or
something), which would look to see if the page already exists as a
writeable buffer.  If it doesn't, it creates it by allocating the
memory and then copying the page from the mmap()ed area to the new
buffer, and returning it.  If it does, it just returns a pointer to
the buffer.  There would obviously have to be some bookkeeping
involved: the storage manager would have to know how to map a mmap()ed
page back to a writeable buffer and vice-versa, so that once it
decides to write the buffer it can determine which page in the
original file the buffer corresponds to (so it can do the appropriate
seek()).

In a write-heavy database, you'll end up with a lot of memory copy
operations, but with the scheme we currently use you get that anyway
(it just happens in kernel code instead of user code), so I don't see
that as much of a loss, if any.  Where you win is in a read-heavy
database: you end up being able to read directly from the pages in the
kernel's page cache and thus save a memory copy from kernel space to
user space, not to mention the context switch that happens due to
issuing the read().


Obviously you'd want to mmap() the file read-only in order to prevent
the issues you mention regarding an errant backend, and then reopen
the file read-write for the purpose of writing to it.  In fact, you
could decouple the two: mmap() the file, then close the file -- the
mmap()ed region will remain mapped.  Then, as long as the file remains
mapped, you need to open the file again only when you want to write to
it.


--
Kevin Brown                          kevin@sysexperts.com

pgsql-performance by date:

Previous
From: Dave Cramer
Date:
Subject: Re: Excessive context switching on SMP Xeons
Next
From: "Igor Maciel Macaubas"
Date:
Subject: View & Query Performance