Re: First set of OSDL Shared Mem scalability results, some wierdness ... - Mailing list pgsql-performance

From Tom Lane
Subject Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Date
Msg-id 4859.1097363137@sss.pgh.pa.us
Whole thread Raw
In response to Re: First set of OSDL Shared Mem scalability results, some wierdness ...  (Kevin Brown <kevin@sysexperts.com>)
Responses Re: First set of OSDL Shared Mem scalability results, some wierdness ...  (Kevin Brown <kevin@sysexperts.com>)
Re: First set of OSDL Shared Mem scalability results, some  (Curt Sampson <cjs@cynic.net>)
List pgsql-performance
Kevin Brown <kevin@sysexperts.com> writes:
> Tom Lane wrote:
>> mmap() is Right Out because it does not afford us sufficient control
>> over when changes to the in-memory data will propagate to disk.

> ... that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).

You're almost there.  Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe.  That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet.  In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk.  (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon.  This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way.  Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk.  It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes.  If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

            regards, tom lane

pgsql-performance by date:

Previous
From: Kevin Brown
Date:
Subject: Re: First set of OSDL Shared Mem scalability results, some wierdness ...
Next
From: Gaetano Mendola
Date:
Subject: kernel 2.6 synchronous directory